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1.  INTRODUCTION 

In  this  report,  we  present  the  results  of  our  work  in  the 
past  three  years  on  data  coapression  and  quality  evaluation  of 
digital  speech.  The  overall  goal  of  our  research  has  ween  to 
develop  and  iapleaent  techniques  for  digitally  transaitting  high 
quality  speech  at  the  lowest  possible  data  rates.  Ne  have 
developed  these  techniques  for  Linear  Predictive  Coding  (LPC) 
systeas  (also  known  as  LPC  vocoders).  Also,  they  have  been 
designed  for  transaitting  speech  over  packet-switched 
coaaunication  aedia,  an  exaaple  of  which  is  the  ARPA  Network; 
these  aedia  handle  data  aessages  in  a tiae-asynchronous  fashion. 
As  a result,  the  data  rate  of  our  digital  vocc^er  varies  in  tiae 
in  accordance  with  the  properties  of  the  incoaing  speech  signal. 
The  variable  transaisaion  rate  has  a low  upper  bound  as  well  as  a 
low  average,  an  iaportant  consideration  for  a real-tiae 
application  such  as  transaission  over  the  ARPA  Network. 

1.1  Suaaary  of  Major  Results 

Analysis  Methods 

Ne  developed  a new  analysis  aethod  for  linear  prediction, 
called  covariance  lattice  aethod.  The  aethod  coabines  all  the 
desirable  properties  of  the  traditional  autocorrelation  and 
covariance  aethods,  and  requires  about  the  saae  coaputational 
coaplexity  as  the  other  two.  These  properties  are;  (1)  Windowing 
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of  tha  algnal  la  not  raquirad)  (2)  Tha  raaulting  all>pola  linaar 
pradiction  filtar  ia  guarantaad  to  ba  atabla?  (3)  Stability  is 
lasa  aanaitiva  to  finita  wordlangth  coaputationa;  and  (4) 
Quantitation  of  tha  lattica  aodal  paranatara  (for  tha  purpoaa  of 
data  coaprasaion)  can  ba  accoapliahad  within  tha  racuraion  for 
ratantion  of  accuracy  in  rapraaantation. 

Na  axtandad  tha  lattica  aathod  to  par fora  adapt iva  analyaia 
in  tha  aanaa  of  providing  naw  aatiaataa  for  tha  lattica 
paraaatara  for  evary  apaach  aaapla.  Adaptive  aathoda  in  general 
offer  aavaral  advantages  over  tha  above  "block*  analyaia  aathodsi 
these  include  tha  option  to  choose  which  sat  of  coefficient 
astiaatas  to  transait  in  a given  sagaant  of  tha  signal » and 
siaplar  hardware  realisation.  In  addition,  our  adaptive  lattica 
aathoda  ensure  filtar  stability,  and  possess  a desirable 
convergence  property  in  that  tha  convergence  is  alaost 
independent  of  tha  spectral  dynaaic  range  of  tha  input  signal. 

Also,  yf  developed  a linaar  predictive  spectral  warping 
technique  to  ba  included  as  part  of  tha  analyser.  This  technique 
aakas  aora  affective  use  of  the  bits  needed  to  transait  spectral 
inforaation. 
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Paraaiatar  Quantitation  • 

Ne  developed  improved  quantization  schemes  for  LPC 
parameters:  log  area  ratios  (LARs) , pitch  and  gain.  The  scheme 
for  LARs  employs  unequal  quantization  step  sizes  for  the 
different  coeff icients»  with  the  step  sizes  derived  by  taking 
advantage  of  the  differences  in  spectral  sensitivity  levels  of 
individual  I^Rs.  The  pitch  quantisation  scheme  makes  efficient 
use  of  all  the  levels,  in  the  sense  that  the  decoded  pitch  values 
corresponding  to  these  levels  are  all  distinct.  As  LPC  gain 
parameter,  we  found  the  energy  of  the  speech  signal  to  be  a 
desirable  choice. 

Perceptual  Model  of  Speech  and  Variable  Frame  Rate  Transmission 

We  formulated  and  experimentally  validated  a functional 
perceptual  model  of  speech  in  which  speech  is  represented,  with 
only  a minimal  loss  in  perceived  quality,  in  terms  of  LPC 
parameters  extracted  time~asynchronously  at  a minimum  set  of  time 
instances  and  in  terms  of  linear  parameter  variation  over  the 
interval  between  these  time  instances.  Based  on  this  model,  tre 
developed  new  variable  frame  rate  (VPR)  transmission  schemes  for 
LARS,  pitch  and  gain,  we  applied  these  VPR  compression  schemes 
to  a 100  frames/sec  fixed-rate  LPC  vocoder  with  a bit  rate  of 
about  5700  bps  (bits/sec)  to  produce  a variable  rate  vocoder  with 
an  average  bit  rate  of  only  about  2100  bps  for  continuous  speech 
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and  with  approx iaataly  tha  saaa  apaach  quality  at  tha  fixad-rata 
aystaa.  Uaa  of  Huffman  coding  and  variable  order  linear 
prediction  (two  of  a number  of  tachniquaa  that  we  developed  under 
a previoua  ARPA  project  [1])  would  further  lower  the  average  bit 
rate  to  about  15BB  bpa,  with  no  change  in  perceived  apeech 
quality. 

A Mixed-Source  Nodal  to  Improve  Speech  Syntheaia 

Nith  the  objective  of  enhancing  the  naturalneas  of  the 
ayntheaixed  apeech , we  developed  a new  model  for  generating  the 
excitation  aignal  for  the  LPC  ayntheaixer.  In  contract  to  the 
traditional  idealixed  pulae/noiae  (or  voiced/ unvoiced)  aource 
model r the  new  model  mixes  the  pulse  and  noise  excitations.  The 
mix  is  achieved  by  dividing  the  speech  spectrum  into  two  regions, 
with  the  pulse  source  exciting  the  low-frequency  region  and  the 
noise  source  exciting  the  high-frequency  region.  The  cutoff 
frequency  that  separates  the  two  regions  is  adaptively  varied  in 
accordance  with  the  changing  speech  signal.  Experiments  using 
the  new  model  indicated  its  potrar  in  synthesising  natural 
sounding  voiced  fricatives,  and  in  largely  eliminating  the 
"bussy*  quality  of  vocoded  speech. 


BBN  Report  No.  3794 


Bolt  Berenek  and  Newman  Inc. 


Subjective  Speech  Quality  Evaluation 

We  developed  and  tested  an  improved  method  for  measuring 
subjective  speech  quality.  The  method  uses  a set  of  six 
specially  designed  sentences,  each  read  by  six  talkers.  The 
material  is  both  representative,  in  that  it  covers  a wide  range 
of  speech  events  and  talker  characteristics,  and  also 
challenging,  in  that  some  speech  material  is  included  that  %K>uld 
fully  extend  any  LPC  vocoder's  abilities.  Applying  this  method, 
we  obtained  several  practical  results.  For  example,  by  studying 
speech  quality  as  a function  of  vocoder  parameters,  we  derived 
tradeoff  relations  to  define  the  combination  of  vocoder 
parameters  yielding  the  best  quality  for  any  desired  overall  bit 
rate.  In  another  test,  we  showed  that  variable  frame  rate 
transmission  techniques  can  produce  the  highest  quality  at  any 
given  rate,  compared  to  t%fo  other  methods  which  controlled  the 
bit  rate  by  adjusting  the  LPC  order  or  by  varying  the  log  area 
ratio  quantization  step  size.  Also,  we  formally  demonstrated  the 
effectiveness  of  our  perceptual-model-based  VFR  scheme  and  its 
superiority  to  our  earlier  log-likelihood  ratio  VFR  scheme.  In 
addition,  we  generated  subjective  speech  quality  data  which  we 
then  used  as  a baseline  for  correlating  against  results  obtained 
from  our  objective  methods  of  speech  quality  assessment. 

As  part  of  our  subjective  speech  quality  work,  we  also 
investigated  a few  other  topics  including:  (1)  a phoneme-specific 
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intelligibility  teat,  using  nonsense  aaterialsi  (2)  the  effect  of 
lost  packets  on  the  intelligibility  of  speech  transmitted  over 
MtPANBTi  and  (3)  development  of  a method  to  reduce  stimulus 
sequence  effects  on  listeners'  judgments. 

Objective  Speech  Quality  Evaluation 

We  formulated  a general  framework  for  objective  speech 
quality  evaluation  of  narrowband  LPC  vocoders.  Within  this 
framework,  we  developed  several  objective  methods.  in  each 
method,  the  error  in  short-term  spectral  behavior  between  vocoded 
speech  and  the  original  is  computed  once  every  10  ms.  ^ese 
errors  are  appropriately  weighted  and  averaged  over  an  utterance 
to  produce  a single  objective  score.  We  evaluated  the  objective 
methods  by  correlating  the  resulting  objective  scores  with  formal 
subjective  speech  quality  judgments.  The  usefulness  of  our 
methods  was  clearly  indicated  by  the  high  correlations  that  we 
obtained. 

Real-Time  Implementation 

The  current  BBN  speech  facility  has  evolved  mostly  during 
the  last  three  years.  Briefly,  it  consists  of  the  following:  the 
SPS-41  computer  with  a dual-port  memory  interface  and  a dual 
channel  A/D  and  D/A  converter  system;  the  PDP-11/4B  computer 
with  an  RTll  operating  system,  an  INPIIA  interface  to  provide  a 
link  to  the  ARP\  Network,  the  INLAC  PDS-1  display  minicomputer  as 
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a paripharal  to  tha  PDPllf  and  a aoftwara  package  which  includaa 
an  FTP  (File  Tranafar  Protocol) , a raal-tiaa  apaaoh  acquiaition, 
wavaforw  diaplay  and  editing  program,  and  a convenient 
interactive  playback  program. 

Ne  cooperated  with  the  other  aitea  in  the  ARPA  community  in 
implementing  an  LPC  vocoder  that  tranamita  apeech  over  the  ARPA 
Network  in  real  time.  Alao,  we  provided  apecif icationa  to  ARPA 
LPC-II  ayatem,  the  firat  real-time  variable-rate  apeech 
compreaaion  ayatem  on  the  ARPANET. 

1.2  Outline  of  Report 

Before  we  outline  the  contenta  of  Sectiona  2-11,  we  note 
that  the  reaulta  of  our  work  on  varioua  topica  have  been 
previoualy  reported  in  the  form  of  conference  or  journal  papera 
and  ARPA  Net«rork  Speech  Compreaaion  (NSC)  notea.  We  deacribe 
theae  reaulta  briefly  in  the  main  aectiona  of  the  report,  and 
include  theae  papera  aa  appendicee.  Of  course,  topica  that  «re 
have  not  previoualy  reported,  or  on  which  additional  work  has 
been  performed  since  the  previous  reporting,  are  dealt  with  in  a 
detailed  manner. 

In  Section  2,  we  describe  three  analyais  methods t covariance 
lattice,  adaptive  lattice  and  linear  predictive  warping. 
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Section  3 contains  the  description  of  improved  quantisation 
schemes  for  LARs  and  pitch.  Also  considered  in  this  section  is 
the  question  of  which  of  the  two  candidates  for  LPC  gain 
parameter,  speech  signal  energy  and  linear  prediction  error 
signal  energy,  produces  a smaller  quantisation  error. 

Section  4 describes  in  detail  our  new  variable  frame  rate 
transmission  schemes  for  LARs,  pitch  and  gain.  First,  we  briefly 
review  our  work  on  VFR  transmission  performed  on  a previous  ARPA 
project  [1].  Then,  we  state  our  perceptual  model  of  speech,  and 
indicate  a major  difference  between  the  previous  VFR  scheme  and 
the  new  VFR  scheme  based  on  this  perceptual  model.  Next,  the 
various  features  of  the  new  VFR  scheme  for  LAR  transmission  are 
described  at  length,  followed  by  the  experimental  results  of 
comparisons  of  the  speech  quality  of  an  LPC  vocoder  which 
transmitted  LARs  at  a variable  rate  using  this  new  scheme  but 
pitch  and  gain  at  a fixed  rate,  with  the  speech  quality  of 
several  other  fixed-rate  and  variable-rate  vocoders.  Next,  to 
substantially  reduce  the  computational  burden,  we  propose  a 
simplified  VFR  scheme  for  LAR  transmission.  Finally,  two  types 
of  VFR  schemes  for  the  transmission  of  pitch  and  gain  are 
presented. 

In  Section  5,  we  consider  our  work  on  three  issues  related 
to  the  operation  of  the  LPC  synthesiser.  These  aret  optimal 
linear  interpolation  of  synthesiser  parameters,  implraentation  of 
synthesiser  gain,  and  all-pass  excitation. 
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Section  6 deals  with  our  ■ixed-source  «odel«  an  automatic 
scheae  to  extract  the  aodel  paraaeter  (cutoff  frequency) , 
iapleaentation  of  the  aodel  at  the  synthesiser,  and  the  effect  on 
vocoded  speech  due  to  the  use  of  the  aodel. 


Our  work  on  subjective  speech  quality  evaluation  is  | 

I 

presented  in  detail  in  Section  7.  First,  we  describe  the 
development  and  testinq  of  a subjective  quality  aeasureaent 

procedure.  The  results  obtained  by  applying  this  procedure  to  | 

three  practical  problems  are  given  next.  The  section  ends  with 
discussions  on  several  miscellaneous  topics  in  the  subjective 

quality  evaluation  area  that  we  worked  on  as  part  of  this  ;; 

i! 

project. 


Section  8 deals  with  our  efforts  on  the  task  of  objective 
speech  quality  evaluation.  The  section  starts  with  a statement 
of  a general  framework  that  we  used  in  dealing  with  this  task. 
Next,  several  distance  measures  are  described  for  computing  the 
error  in  short-term  spectral  behavior  between  vocoded  speech  and 
the  original.  Methods  for  time-weighting  and  time- aver aging  the 
computed  frame  spectral  errors  over  an  utterance  are  considered. 
Finally,  the  results  obtained  by  comparing  objective  speech 
quality  scores  against  subjective  judgments  are  presented. 

In  Section  9,  we  describe  our  work  towards  developing  a 
real-time  speech  facility  at  BBN.  Also,  we  briefly  summarise  the 
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apaciflcationa  that  «a  provided  for  RRPA  LPC-II  apaach 
coapraaalon  ayataa. 

Two  additional  topica  that  wa  hava  alao  worked  on  during 
thia  project  are  conaidarad  in  Section  9.  Thaaa  arat 
Differential  Pulaa  Coda  Modulation  (DPCN)  coding  of  LPC 
paraaetara,  and  linear  predictive  foraant  vocoder. 
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2.  ANALYSIS  METHODS 

A number  of  new  analysis  methods  have  been  developed,  some 
of  which  promise  to  have  a major  impact  in  various  estimation  and 
modelling  applications.  The  first  two  sections  describe  our 
contributions  to  the  area  of  lattice  methods  in  linear  prediction 
analysis.  The  last  section  presents  the  method  of  linear 
predictive  spectral  warping,  which  makes  more  effective  use  of 
the  bits  needed  to  transmit  spectral  information. 

2.1  Covariance  Lattice  Methods 

The  autocorrelation  method  of  linear  prediction  guarantees 
the  stability  of  the  all-pole  filter,  but  has  the  disadvantage 
that  windowing  of  the  speech  signal  causes  some  unwanted 
distortion  in  the  spectrum.  In  practice,  even  the  stability  is 
not  always  guaranteed  with  finite  wordlength  (FNL)  computations. 
On  the  other  hand,  the  covariance  method  does  not  guarantee  the 
stability  of  the  filter  even  with  floating-point  computation,  but 
it  has  the  advantage  that  there  is  no  windowing  and  hence  no 
unnecessary  distortion  of  the  signal  spectrum.  To  combine  the 
advantages  of  these  two  methods,  we  developed  a new  formulation 
for  linear  prediction,  which  we  call  the  covariance  lattice 
method  (see  Appendis  1 for  details).  The  method  is  one  of  a 
class  of  lattice  methods  which  guarantee  the  stability  of  the 
all-pole  linear  prediction  filter,  with  or  without  windowing  of 
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the  signal  r and  with  the  nuaber  of  coaiputations  being  coaparable 
to  the  autocorrelation  and  covariance  aethodsi  also,  stability  is 
less  sensitive  to  PNL  coaputations. 

We  incorporated  the  covariance  lattice  aethod  into  our 
floating-point  siaulation  of  the  LPC  speech  coapression  systea. 
This  also  involved  "tuning*  of  such  quantities  as  the  analysis 
interval  and  the  criterion  for  deteraining  optiaal  LPC  order. 
(The  latter  is  required  when  variable  order  linear  prediction  is 
used  [1].)  The  result  was  approxiaately  the  saae  speech  quality 
as  that  froa  our  earlier  ISfl  bps  LPC  systea  [1]  (which  used  the 
autocorrelation  aethod)  at  about  the  saae  total  coaputation  tiae. 
In  fixed-point  iapleaentations,  however,  the  lower  sensitivity  of 
filter  stability  to  PWL  coaputations  provided  by  the  covariance 
lattice  aethod  is  expected  to  lead  to  an  iaproveaent  in  speech 
quality  relative  to  that  frcai  the  autocorrelation  LPC  systea. 
Purtheraore,  the  covariance  lattice  aethod  peraits  the 
coefficients  to  be  quantised  within  the  recursion,  thus 
integrating  quantisation  into  the  coefficient  estiaation  process^ 
this  is  expected  to  iaprove  the  accuracy  of  the  estiaated 
short- tera  speech  spectrua,  and  hence  to  iaprove  the  quality  of 
the  synthesised  speech.  (In  non-lattice  aethods,  quantisation  is 
done  only  after  coapleting  coefficient  estiaation.)  However,  one 
of  the  major  benefits  of  lattice  aethods  is  expected  to  be  in 
siapler  hardware  realisations. 
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2.2  Adept Iva  Lattice  Nathoda 

Cover lanca  lattice  aethoda  are  appropriate  for  "block* 
analysia  of  apeech,  whereby  the  speech  is  analysed  a fraae  at  a 
tiae.  However,  for  certain  hardware  realisations,  it  aight  be 
siapler  to  perfora  an  adaptive  type  of  analysis,  which 
continuously  updates  the  values  of  the  reflection  coefficients  in 
the  lattice.  This  has  the  advantage  that  one  can  choose  which 
set  of  coefficients  to  transait  in  a particular  speech  interval. 
Having  such  a choice  aight  be  iaportant  in  obtaining  consistent 
spectral  estiaates  that  are  not  as  affected  by  the  quasi-per iodic 
nature  of  voiced  speech  as  are  the  regular  block  estiaation 
aethoda,  such  as  the  autocorrelation  and  covariance  aethods. 

We  have  recently  developed  the  theoretical  basis  for 
adaptive  lattice  estiaation  (see  Appendices  2 and  3 for  details) . 
Although  the  aethods  have  not  been  tested  out  thoroughly  for 
speech,  it  is  espected  that  they  would  give  positive  results. 
One  of  the  aajor  properties  of  adaptive  lattice  aethods  is  that 
the  convergence  to  the  optiaal  values  is  alaost  independent  of 
the  spectral  dynaaic  range  of  the  input  signal  (i.e.  independent 
of  the  eigenvalue  spread  of  the  signal  covariance  aatrix) . This 
property,  absent  in  aany  previous  adaptive  aethods,  proaises  to 
have  wide-ranging  applications  in  coaaunication  systeas,  wherever 
adaptive  transversal  filters  are  used. 
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2.3  Linear  Predictive  Warping 

In  ^pendix  4,  we  include  a detailed  description  of  a 
general  method  for  the  spectral  distortion  or  warping  of  speech 
signals.  The  basic  idea  is  to  decompose  the  speech  signal,  on  a 
short-time  basis,  into  two  components:  a spectral  envelope  and  an 
excitation  signal.  The  spectrum  is  then  warped  in  any  desired 
manner  and  then  recombined  with  the  excitation  to  fbrm  a new 
signal  with  a warped  spectrum  but  with  the  same  pitch  and 
intonation.  The  method  has  many  potential  applications, 
including  unscrambling  of  helium  speech,  spectral  warping  for  the 
hard-of-hearing,  and  more  efficient  communications. 

The  application  to  efficient  cixniunications  is  in  the  form 
of  an  LPC  vocoder  with  warping,  LPCW.  This  is  described  in 
detail  in  J^pendix  5.  The  reasoning  for  this  type  of  analysis  is 
as  follows.  In  ordinary  linear  prediction  the  speech  spectral 
envelope  is  modeled  by  an  all-pole  spectrum.  The  error  criterion 
employed  guarantees  a uniform  fit  across  the  whole  frequency 
range.  However,  we  know  from  speech  perception  studies  that  low 
frequencies  are  more  important  than  high  frequencies  for 
perception.  Therefore,  a minimally  redundant  model  would  strive 
to  achieve  a uniform  perceptual  fit  across  the  spectrum,  which 
means  that  it  should  be  able  to  represent  low  frequencies  more 
accurately  than  high  frequencies.  In  an  attempt  to  achieve  such 
a uniform  perceptual  fit,  we  applied  our  linear  predictive 
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■poctral  warping  tachniqua  to  LPC  vocoding.  Tha  raaulting 
vocodar,  danotad  by  LPCN,  can  althar  laprova  tha  vocodad  apaach 
quality  for  a givan  bit  rata  or  lowar  tha  bit  rata  for  a givan 
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apaach  quality. 

Briefly,  at  tha  transaittar  of  tha  LPCN  vocoder,  tha 
short-tiaa  apaach  apactrua  ia  warped  auch  that  high  fraquanciaa 
are  coapraaaad  relative  to  low  fraquanciaa,  in  tha  aanaa  that 
frequency  raaolution  ia  batter  at  low  fraquanciaa  than  at  high 
fraquanciaa  (but  apactral  aaplitudaa  are  not  affected  by  thia 
warping) j regular  LPC  analyaia  ia  than  parforaad  on  tha  warped 
apactrua.  At  tha  receiver,  tha  all-pole  apactrua  coaputad  froa 
tha  decoded  paraaatara  ia  dawarpad  uaing  tha  invaraa  of  tha 
warping  function,  and  than  regular  LPC  analyaia  ia  carried  out  on 
tha  dawarpad  apactrua.  LPC  coafficianta  raaulting  froa  tha  laat 
atap  are  in  turn  aaployad  in  aynthaaiaing  tha  apaach  wavafora. 
Synthaaia  axpariaanta  parforaad  uaing  tha  LPCN  vocoder  indicated 
that  tha  introduction  of  apactral  warping  produced  a aaving  of 
about  11-15%  in  bit  rata  without  affecting  tha  apaach  quality. 
Tha  indicated  aaving,  however,  ia  achieved  at  tha  axpanaa  of 
incraaaad  coaiputation  relative  to  a regular  LPC  vocoder. 
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3.  BJOUUIBTBB  QUAMTItATZOM 

Tha  paraaatara  of  our  LPC  voeodar  arat  log  araa  ration 
(lABs) « pitch  and  gain.  Na  davalopad  isprovad  quantitation 
aehaaaa  for  lARa  and  pitch.  Aa  gain  paraaatar,  ona  can  tranaait 
aithar  tha  anargy  of  tha  apaach  signal,  or  tha  anargy  of  tha 
pradiction  arror  signal.  Through  atatiatical  arror  analyaia,  «a 
datarainad  which  of  thaaa  two  anargias  lad,  in  ganaral,  to  a 
aaallar  quantisation  arror.  Dataila  of  our  work  on  thaaa  issuas 
ara  givan  balow. 

3.1  Quantisation  of  Log  Araa  Ratios 

in  our  pravioua  work  wa  ahowad  that  linaar  or  unifora 
quantisation  of  LARa  ia  optimal  in  tha  aanaa  of  a ainiaas 
apactral  arror  critarion  12].  In  dariving  this  rasult  wa  usad  a 
prototypa  apactral  sanaitivity  char actar 1st ic  of  tha  raflaction 
coaff iciants,  which  was  obtained  by  averaging  spectral 
sensitivity  over  a number  of  speech  sounds  and  over  different 

4 

reflection  coefficients.  Tha  resulting  quantisation  schema  had 
tha  same  step  sisa  for  quantising  all  tha  LARs.  However,  whan  wa 
averaged  tha  spectral  sensitivity  of  each  raflaction  coefficient 
separately  over  a number  of  speech  sounds,  wa  found  that  irtiila 
tha  sensitivity  curves  of  tha  different  raflaction  coefficients 
had  tha  same  ganaral  0-shapa,  they  ware  located  at  different 
sensitivity  levels.  By  taking  advantage  of  these  differences  in 
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aanaitivity  lavals  of  tha  raflaction  coafficlanta  or  aquivalantly 
IARa»  wa  davalopad  an  inprovad  quantisation  achaaa  that  uaaa 
unaqual  stop  sisaa  for  tha  diffarant  LANs. 

LAB  Sanaitivitv  plots 

Bsiploying  tha  axparisiantal  procadura  that  wa  propoaad  in  our 
pravioua  work  (2] , wa  coaputad  tha  spactral  aanaitivity  of  aach 
LAR  and  avaragad  it  ovar  a nuabar  of  apaach  aounda.  Our  apaach 
data  baaa  conaiatad  of  12  uttarancaa  (froa  6 aalaa  and  6 faaalaa) 
of  a total  duration  of  about  3B  aacj  apaach  waa  low-paaa  filtered 
at  5 kHs  and  aaaplad  at  It  kHs*  A ll-th  order  linear  prediction 
analyais  waa  carried  out  on  fraaaa  of  2B  aa  duration  of 
praaaphaaisad  apaach i wa  uaad  tha  firat-ordar  praaaphaaia  filter 
(1-.969  s~^).  LPC  analyais  produces  tha  raflaction  coefficients 
(k^),  which  are  related  to  the  LARs  (g^)  expressed  in  decibels  by 
the  one-to-one  aapping  [2] t 

1 + k. 

9^-10  log^j  — ^ , lSiSP-12  ,3.1) 

l-kf 

Na  coaputad  tha  sensitivity  of  aach  of  the  12  LARa  at  13 
equally-spaced  points  over  the  range  -18  to  18  dB,  as  follows. 
The  value  of,  say*  the  i-th  LAR  was  set  equal  in  turn  to  one  of 
those  13  valuasr  while  tha  other  11  LARs  %rara  kept  constant  at 
their  raspactiva  values  obtained  through  LPC  analysis  for  that 
fraaa.  g^^  was  than  perturbed  by  a small  amount » and  tha 
corresponding  change  in  tha  spectrum  of  tha  linear  predictor  and 
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thus  ths  ssnsitivity  of  wore  Mssursd»  as  axplainad  in  our 
papar  (2] . Tha  sansitivity  Masuraaant  procadura  was  rapaatad 
for  aach  of  tha  othar  11  LARa.  Tha  13  sansitivity  valuaa  of 
individual  LARs  wars  than  avaragad  aaparataly  ovar  25  voicad 
fraaiaa  and  15  unvoicad  fraaaa,  salactad  fro*  our  data  basa. 
Piguras  3.1  and  3.2  dapict  avaragad  apactral  sansitivity  curvaa 
of  individual  LARs  for  raspactivaly  voicad  and  unvoicad  apaach 
sounds.  Bach  figura  has  12  sansitivity  curvaa  corrasponding  to 
12  LARs  and  also  an  avaraga  of  all  tha  12  sansitivity  curvas. 
(Na  have  assuaad  a linaar  variation  in  sansitivity  batwaan  tha 
coaputad  13  valuas.) 


Avaraga  Sansitivity  Laval s 


In  ordar  to  dariva  tha  stap  sisas  for  quantising  tha  LARs, 
first  wa  naad  to  transfora,  for  aach  LAR,  its  sansitivity  curva 
to  ona  nuabar  «dtich  aa  shall  call  its  avaraga  sansitivity  laval. 
For  tha  ith  LAR  g^,  it  is  raasonabla  to  dafina  its  avaraga 
sansitivity  laval  as 


s « E P 

H 'iK 


(3.2) 


idiara  tha  ranga  of  g^^  is  raprasantad  by  L^  aqually  spacad  points 
Gi|(,  l<k<L^i  dS/dg|^  is  tha  apactral  sansitivity  of  g^^i  is  tha 
probability  of  g^^  taking  tha  valua  It  is  claar  that  is 

approxiaataly  aqual  to  tha  axpactad  valua  of  4S/ag^  if  l^  is 

I 

suff^iantly  larga. 
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Fig.  3.1  Spectral  sensitivity  curves  for  LARs  of  a 12th  order 

linear  predictor,  averaged  over  voiced  sounds  only.  The 
top  curve  corresponds  to  the  first  LAR;  the  bottom  curve 
to  the  12th  LAR.  Some  sensitivity  curves  cross  each 
other  as  shown.  The  average  of  the  12  sensitivity  curves 
is  drawn  along  circled  points. 
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LOG  AREA  RATIO  (OB) 


Fig.  3.2  Spectral  aenaltlvlty  curvet  for  LARt  of  a 12th  order 
linear  predictor,  averaged  over  unvoiced  sounds  only. 
The  top  curve  corresponds  to  the  first  LAR;  the  bottom 
curve  to  the  12th  LAR.  Some  sensitivity  curves  cross 
each  other  as  shown.  The  average  of  the  12  sensitivity 
curves  Is  drawn  along  circled  points. 


mm. 


BBN  Rtport  No.  3794  Bolt  Boranok  and  Nawaan  Inc. 

In  computing  tha  quantitiaa  for  both  voicad  and  unvoicad 
caaaaf  wa  uaad  tha  aanaitivity  data  ahown  in  Piga.  3.1  and  3.2 
and  tha  probability  hiatogram  data  for  LARa  that  wa  pravioualy 
collactad  for  Huffman  coding  purpoaaa  [1].  Na  mantion  hara  that 
thoaa  hiatograma  wara  computad  at  1 dB  intarvala  (or  bin  aiia) 
from  a 111  framaa/aac  linaar  pradiction  analyaia  of  tha 
praamphaaiiad  apaach  from  tha  abova  da^,a  baaa.  Tha  computad 
avaraga  aanaitivity  lavala  givan  in  Tabla  3.1* 

for  both  voicad  and  unvoicad  caaaa.  Notica  that  dacraaaaa 

almoat  monotonically  with  incraaaing  i and  that  tha  aanaitivity 
laval  of  tha  firat  LAR  ia  almost  twica  as  much  as  that  of  tha 
11-th  LAR.  Tha  unaqual-stap-sisa  quantisation  mathod  dascribad 
balow  takas  advantaga  of  this  variation  in  aanaitivity  lavals  in 
datarmining  tha  various  LAR  stap  sisas. 

Quantisation  Mathod 

r 

Using  tha  approach  of  optimal  bit  allocation  strategy  that 
wa  presented  earlier  [2] » wa  computad  tha  number  of  quantisation 
lavals  and  tha  step  sisas  6^  for  tha  different  LARs  as 

follows.  The  total  spectral  deviation  AS  due  to  LAR  quantisation 
errors  Ag^,  where  p is  tha  LPC  order,  is  givan 

approximately  by 

AS  - Z SjAgJ 
1-1  ^ ^ 
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In  an  attempt  to  ninlmiae  the  maximum  apectral  deviation,  we 
replace  IaQiI  hy  ita  maximum  value,  which  ia  equal  to  half  the 
correaponding  atep  aise  for  the  linear  quantitation  of  g^  uaing 
round-off  arithmetic.  (If  truncation  arithmetic  ia  used,  the 
maximum  value  will  be  twice  aa  much,  but  the  conatant  acale 
factor  doea  not  change  the  solution  to  the  minimisation  problem 
given  below.)  Thus 


- 5 ®i«l 


(3.4) 


where 


h - ngi)*ax  “ (9i)BiinJ/**i, 


(3.5) 


and  (g|)max  and  (g£)min  are  the  upper  and  lower  bounds  on  gj,. 
The  problem  is  to  minimise  (AS)g|ii^  with  respect  to  (N^)  subject 
to  the  constraint  that  the  total  number  of  bits  used  for 
quantising  p lARs  be  equal  to  a prespecified  value  Mt 


ill  **i  ■ **• 


(3.6) 


The  solution  to  the  above  constrained  minimisation  problem  is 
given  below t 


Nj  - 


r 


P 


iBi  i 


(3.7) 


N, 


, 2<i2p  , 
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whara  - [(gi)*,*  - (gi)«inl  8i»  l<i<P'  (3«*) 

TO  coapara  unaqual  atap  alia  quantiaation  with  aqual  atap 
alia  quantiaation,  wa  hava  liatad  in  Tabla  3.2  tha  nuabara  of 
quantiaation  lavala  for  thaaa  two  aathoda  with  tha  aaaa  total 
nuabar  of  bita  and  conaidaring  voicad  and  unvoicad  caaaa 
aaparataly.  Aa  axpactad,  ralativa  to  tha  aqual  atap  aiaa  aathod, 
tha  unaqual  atap  aiaa  aathod  placaa  aora  aaphaaia  on  tha  firat 
thraa  LARa  by  allotting  aora  lavala  to  thaa.  Synthaaia 
axpariaanta  ahowad  that  uaa  of  tha  unaqual  atap  aiaa  quantiaation 
aathod  producad  battar  quality  apaach.  Tha  parcaivad 
quantiaation  noiaa  in  tha  aynthaaiaad  apaach  waa  raducad 
noticaably  whan  tha  tranaaiaaion  rata  waa  vary  low  (a.g.,  lllf 
bpa) . 

It  ahould  ba  notad  that  for  raal-tiaa  iaplaaantation,  whila 
tha  aqual  atap  alia  aathod  raquiraa  only  ona  coding  tabla  and  ona 
dacoding  tabla,  tha  unaqual  atap  aiaa  aathod  in  ganaral  raquiraa 
p coding  tablaa  and  p dacoding  tablaa. 

3.2  Pitch  Quantiaation 

Quantiaation  of  pitch  praaanta  an  altogathar  diffarant 
problaa  froa  tha  quantiaation  of  othar  tranaaiaaion  paraaatara. 
Tha  aajor  diffaranca  ia  that  tha  dacodad  pitch  valuaa  ara 
conatralnad  to  ba  intagara  (aaaplaa  par  pitch  pariod).  Anothar 
difficulty  ariaaa  in  attaapting  to  quantiaa  tha  log  pitch  in  that 
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Coeff.  # 

VOICED  (43 

BITS) 

UNVOICED  (41  BITS) 

Equal  Step  (IdB) 

Unequal  Step 

Equal  Step  (Idb) 

Unequal  Step 

1 

28 

43 

29 

51 

2 

22 

31 

21 

28 

3 

19 

24 

14 

18 

4 

15 

17 

13 

15 

5 

14 

15 

10 

11 

6 

13 

13 

9 

9 

7 

13 

12 

12 

11 

8 

14 

12 

11 

10 

9 

12 

10 

10 

9 

10 

11 

9 

10 

9 

11 

9 

7 
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at  the  high  frequency  end  (small  pitch  period)  of  the  range  of 
interest,  the  quantisation  bin  sise,  as  found  by  dividing  the  log 
pitch  scale  into  equal  segments,  can  be  smaller  than  the  distance 
between  t«K>  allowable  pitch  values  (for  decoding).  This  leads  to 
cases  where  two  distinct  quantisation  bins  yield  the  same  decoded 
value,  thus  wasting  some  quantisation  levels.  In  ARPA  M8C  Mote 
#49  [3],  we  proposed  a method  for  deriving  the  pitch  encoding  and 
decoding  tables  in  such  a way  that  maximum  usage  is  made  of  the 
different  quantisation  levels.  Our  simulation  systMi  was 
modified  to  use  this  improved  pitch  quantisation  scheme. 
Considering  pitch  frequencies  over  the  range  5I-45B  Hs  and  using 
6 bits  for  quantisation,  the  improved  coding/decoding  tables  are 
given  in  Table  3.3.  The  quantisation  level  ff  denotes  unvoiced 
frame.  When  the  pitch  period  in  number  of  samples  is  greater 
than  or  equal  to  C(i),  the  i-th  entry  in  the  column  C,  but  less 
than  C(i-fl),  then  it  is  coded  as  level  i and  decoded  as  D(i) 
samples.  For  example,  a pitch  period  of  100  samples  is  coded  as 
level  44  and  decoded  as  101  samples.  A pitch  period  less  than  21 
samples  is  coded  as  level  1,  and  similarly  a pitch  greater  than 
200  samples  is  coded  as  level  63. 

Statistics  of  differences  in  quantised  pitch  values  using 
the  above  scheme  irere  collected  for  a number  of  speech  utterances 
from  male  and  female  speakers  for  use  in  Huffman  coding  of  pitch. 
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21.500 

40.505 

1 

22 

18 

41 

22.502 

41.990 

2 

23 

19 

43 

23.502 

43.517 

3 

24 

20 

44 

24.502 

44.978 

4 

25 

21 

46 

25.502 

46.558 

5 

26 

22 

47 

26.502 

47.819 

6 

27 

23 

49 

27.502 

50.175 

7 

28 

24 

51 

28.501 

51.495 

8 

29 

25 

52 

29.504 

52.867 

9 

30 

26 

54 

30.493 

55.043 

10 

31 

27 

56 

31.534 

56.984 

11 

32 

28 

58 

32.374 

59.038 

12 

33 

29 

60 

33.996 

60.879 

13 

35 

30 

62 

35.633 

63.487 

14 

36 

31 

65 

36.498 

66.178 

15 

37 

32 

67 

37.375 

67.829 

16 

38 

33 

69 

39.027 

70.532 

17 

40 

34 

72 

40.505 

73.045 

Tabla  3.3  Pitch  Coding/Decoding  Tables 
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73.045 

118.998 

35 

74 

49 

121 

75.324 

123.032 

36 

77 

50 

125 

78.674 

126.903 

37 

80 

51 

129 

80.999 

131.397 

38 

82 

52 

134 

83.360 

136.541 

39 

85 

53 

139 

86.575 

141.490 

40 

88 

54 

144 

89.368 

146.544 

41 

91 

55 

149 

92.994 

151.371 

42 

95 

56 

154 

96.670 

157.027 

• 

43 

98 

57 

160 

99.363 

162.546 

44 

101 

58 

165 

102.907 

167.854 

45 

105 

59 

171 

107.034 

174.074 

46 

109 

60 

177 

110.998 

179.904 

47 

113 

61 

183 

115.010 

186.371 

48 

117 

62 

190 

118.998 

193.670 

63 

197 

200.000 

CT«bl«  3.3  oontinuad) 
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3.3  Gain  Quantisation 

Ka  gain  paraaatar,  one  can  tranaait  either  the  energy  of  the 
speech  signal*  Rg,  or  the  energy  of  the  prediction  error*  Bp. 
These  two  quantities  are  related  to  each  other  byx 

Ep  - Rg  Vp*  (3.9) 

where  Vp  denotes  the  nornalixed  error  of  the  linear  predictor. 
It  can  be  shown  that  Bp  has  a smaller  dynamic  range  and  hence 
leads  to  a smaller  quantization  error  than  Rg.  Ho%rever*  when 
transmitting  Bp*  a problem  arises  from  the  fact  that  the 
normalised  error  of  the  quantised  predictor  is  different  from  the 
unquantised  case.  This  causes  an  error  in  the  energy  of  the 
synthesised  speech  even  when  Ep  is  not  quantised  before 
transmission.  This  of  course  is  not  the  case  if  we  transmit  Rg. 
Another  consideration  in  deciding  which  transmission  parameter  to 
use  for  gain  is  the  type  of  synthesiser  implementation.  Regular 
filter  realization  (direct  form  or  ladder  structure)  and 
normalised  filter  realisation  [4]  are  the  two  types  used  by  the 
MSC  group.  The  gain  of  the  regular  filter  is  equal  to  the  square 
root  of  Bp,  while  the  gain  of  the  normalised  filter  is  equal  to 
the  square  root  of  Rg.  Thus*  for  example*  if  the  receiver 
employs  the  normalised  filter*  it  is  better  to  transmit  Rg  since 
transmitting  Bp  in  this  case  requires  computing  the  normalised 
error  of  the  synthesiser  filter  and  dividing  with  it  the  received 
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[ Bp  to  obtain  tha  noraalisad  filter  gain.  Avoiding  these  extra  j 

I operationa  nay  be  desirable  particularly  for  real-tiae 

I iapleaentation.  j 

[ 

Ne  conducted  a statistical  error  analysis  using  both  Rg  and  | 

Bp  for  transaission  [5] . Our  findings  indicated  that,  in 
general,  it  is  better  to  use  Rg  for  transaission  than  to  use  Bp. 

Such  a choice  is  acre  strongly  recoaaended  when  using  the  . 

noraalised  filter.  The  results  of  this  study  also  suggested  a ' 

third  alternative  which  is  to  transait  the  product  of  Rg  and  the  | 

noraalised  error  of  the  quantised  predictor.  This  alternative 
seeas  attractive  for  the  case  when  the  regular  filter  realisation 
is  used . . 

J 
j 
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4.  VARIABLE  FRAME  RATE  TRAH8MISSI0N 
4.1  Review  of  our  Past  Work 

In  our  previous  work  on  developing  minimally  redundant 

narrowband  speech  transmission  systems,  we  have  used  quite 
successfully  the  concept  of  variable  frame  rate  (VFR) 

transmission  [1].  in  a VFR  scheme,  model  parameters  (LPC 
parameters,  log  pitch,  log  gain)  are  transmitted  only  when  the 
properties  of  the  speech  signal  have  changed  sufficiently  since 
the  preceding  transmission;  the  parameters  for  the  untransmitted 
frames  are  regenerated  at  the  receiver  through  linear 

interpolation  bet%reen  the  parameters  of  the  two  adjacent 

transmitted  frames.  For  example,  speech  parameters  may  be 

transmitted  less  often  during  steady-state  portions  of  speech, 

/ 

and  more  often  during  rapid  speech  transitions. 

Below,  we  briefly  review  the  particular  VFR  transmission 
scheme  that  we  employed  in  our  past  work.  Linear  predictive 
analysis  was  performed  once  every  10  ms  on  speech,  low-pass 
filtered  at  5 kHs  and  sampled  at  10  kHs,  to  extract  100 
frames/ sec  (fps)  of  LPC  data;  pitch,  gain  and  11  log  area  ratios 
(LARS) . Pitch  and  gain  were  transmitted  at  the  full  100  fps 
rate,  while  LARs  were  transmitted  at  a variable  rate  using  the 
following  VFR  scheme.  The  transmission  scheme  computed  the 
distance  or  the  amount  of  deviation  between  the  LARs  of  the 
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current  frame  and  the  LARs  of  the  last  transmitted  frame,  and 
compared  this  distance  against  a preselected  threshold.  The  lARs 
of  the  current  frame  were  transmitted  only  when  the  above 
distance  exceeded  the  threshold.  To  compute  the  above  distance, 
y used  the  so-called  log  likelihood  ratio  measure,  which  is  the 
logarithm  of  the  ratio  of  the  mean-squared  values  of  the  linear 
prediction  error  signal  obtained  for  the  current  frame  (i)  when 
the  optimal  linear  predictor  parameters  (i.e.,  the  LARs  extracted 
for  the  current  frame)  are  used  and  (ii)  when  the  last 
transmitted  parameters  are  used. 

During  the  first  year  of  this  contract,  we  investigated 
several  modifications  to  the  above  VFR  scheme  [6].  An  important 
result  of  this  work  is  the  double-threshold  scheme,  which 
compared  the  log  likelihood  ratio  between  a current  frame  and  the 
previously  transmitted  frame  against  two  thresholds  LRTl  and 
LRT2,  where  LRT2>LRT1.  If  the  log  likelihood  ratio  was  less  than 
LRTl,  the  current  frame  LARs  were  not  transmitted;  if  it  exceeded 
LRTl,  but  not  LRT2,  then  the  current  frame  LARs  were  transmitted; 
if  it  exceeded  both  thresholds,  then  the  LARs  of  the  frame 
immediately  preceding  the  current  frame  were  transmitted.  The 
purpose  of  the  last  step  was  to  avoid  having  to  do  parameter 
interpolation  at  the  receiver  between  largely  different  data 
frames. 

I 

-32- 


■I 


BBN  Report  No.  3794 


Bolt  Beranek  and  Newman  Inc 


The  above  VFR  scheme  is  being  used  in  the  real-time  ARPA-LPC 
System  II,  whose  specifications  we  provided  in  the  form  of  an  NSC 
note  (7].  This  note  provides  a step-by-step  description  for  both 
the  single- threshold  and  the  double- threshold  VFR  schemes. 

Employing  the  above  VFR  scheme,  we  reduced  the  LAR 
transmission  rate  from  100  fps  to  an  average  of  about  37  fps, 
with  only  a small  change  in  the  quality  of  the  resynthesized 
speech  relative  to  the  case  when  all  the  available  100  fps  data 
were  transmitted.  Further,  we  observed  that  any  significant 
reduction  in  the  frame  rate  below  37  fps  introduced,  in  general, 
noticeable  distortions  in  the  speech  quality. 

In  an  effort  to  further  reduce  the  average  frame  rate  of  LAR 
transmission,  without  speech  quality  degradation,  we  developed  a 
new  VFR  scheme  based  on  a functional  perceptual  model  of  speech. 
The  model  and  the  new  scheme  are  described  in  the  next 
subsection. 

4.2  Perceptual-Hodel-Based  VFR  Scheme 

A detailed  description  of  our  perceptual  model  of  speech  and 
manual  and  automatic  VFR  schemes  based  on  this  model  is  contained 
in  a paper  which  is  reproduced  here  as  Appendix  6.  Below,  we 
briefly  review  the  model  and  give  the  details  of  an  improved 
automatic  VFR  scheme  that  we  developed  since  the  time  the  above 
paper  was  written. 
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4.2.1  A Parcaptual  Modal  of  Spaach 

With  tha  motivation  of  davaloping  an  afficiant  VPR 
tranamission  achama,  wa  formulatad  tha  following  parcaptual  modal 
of  apaacht 

1}  Spaach  can  ba  rapfaaantad  in  tarma  of  LPC  (or  othar) 
paramatara  axtractad  at  a minimal  aat  of  parcaptually  aignificant 
tima  pointa  (or  framaa) , not  nacaaaarily  aqually  apacad. 

2)  Batwaan  any  two  auch  tima  pointa,  tha  paramatara  vary 
linaarly. 

3)  Tha  location  of  thaaa  pointa  ia  obtained  independently 
for  pitch,  gain,  and  apactral  (or  LPC)  paramatara. 

Our  requirement  ia  that  tha  quality  of  tha  raaynthaaiaad  apaach 
baaed  on  thia  modal  ahould  ba  no  woraa  than  that  of  tha  unreduced 
or  tha  full  100  fpa  caaa.  Wa  axparimantall>  damonatratad  tha 
validity  of  tha  above  modal  by  uaing  a manual,  trial-and-arror 
achama,  and  wa  achieved  a lo%rar  limit  for  tha  LAR  tranamiaaion 
frame  rata  of  about  2 tranamiaaiona  par  phoneme,  or  about  24  fpa. 
Wa  than  developed  a fully  automatic  two-ataga  achama  which 
approximately  mat  tha  modal  raquiramanta  aa  wall  aa  achieved  thia 
lower  limit  of  24  fpa  (for  LAR  tranamiaaion).  Dataila  on  tha 
manual  and  automatic  achamaa  are  given  in  Appendix  €. 


L 


34 


BBH  Report  No.  3794 


Bolt  Boranok  and  Nawaan  Inc 


A aajor  diffaranca  batwaan  tha  parcaptual-aodal-baaad  VFR 
schaaa  and  our  aarliar  VFR  schana  that  ia  baing  usad  in  ARPA 
LPC-II  ayataa  ia  in  tha  tranamiaaion  atratagyt  our  aarliar 
achaaa  parforma  an  "and-to-and  comparison,"  illustratad  in 
Fig.  4.1a,  batwaan  tha  pracading  tranamittad  frama  and  tha 
currant  frame  baing  considarad  for  transmisaioni  on  tha  other 
hand,  tha  new  schema  as  shown  in  Fig.  4.1b,  compares  LPC 
parameters  of  every  frame  in  tha  transmission  interval  with  those 
obtained  by  linear  interpolation  between  the  two  "end-frames"  and 
computes  the  total  transmission  error  as  some  weighted  average  of 
the  individual  frame  errors.  It  is  this  difference  which  has  led 
to  a substantially  lower  transmission  frame  rate  for  the  new 
scheme  than  for  our  earlier  scheme. 

Below,  «ie  report  on  several  modifications  that  we  made  on 
the  two-stage  VFR  scheme  for  the  transmission  of  LARs. 

4.2.2  Transmission  Error  Computation 

Given  that  LARs  of  the  frame  N,  say,  have  been  transmitted, 
the  basic  strategy  is  to  determine  the  longest  line  extending 
from  5|[(N)  (vector  of  p LARs  for  frame  N)  in  the  p-dimensional 
parameter  space  such  that  the  resulting  transmission  error 
computed  between  the  actual  parameter  vectors 
interpolated  parameter  vectors  2(N-fi)  over  the  duration  of  that 
line  is  less  than  some  threshold  (see  Fig.  4.1b).  First,  we  need 
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to  dafina  fraaa  arror,  or  diatanca  batwaan  two  sata  of  lARa  2.  and 

2 for  any  givan  fraaa,  and  than  apacify  how  thia  arror  la 

avaragad  ovar  aavaral  fraaaa  (tlaa  avaraging) . 

Fraaa  Error 

Tha  fraaa  arror  for  fraaa  n,  danotad  by  B(n) , ia  dafinad  aa 
tha  waightad  Buclidaan  diatanca t 

a ^ 2 ^ 

E(n)  ■ I Wj^ tg^i^ (n)  - g^ (n)  1 / ' (4.1) 

whara  {w£>la  tha  aat  of  coafflciant  waighta  choaan  to  raflact  tha 
ralatlva  iaportanca  of  tha  diffarant  lARa  (praauaably  to 

parcaived  apaach  quality)  1 wa  allow  m <,  p. 

Ha  hava  choaan  tha  coafficiant  waighta  to  ba  tha  axpactad  or 
avaraga  apactral  aanaitivitiaa  of  individual  lARa  (aaa  Tabla 
3.1).  For  tha  firat  4 LARa,  thaaa  ara*  1.3,  1.2,  1.1  and  l.f. 
Thia  weighting  achama  ia  baaad  on  tha  raaaonabla  idea  that  a 
givan  aini;>unt  of  arror  in  a LAR  with  a higher  aanaitivity  ia  aora 
important  to  apactral  accuracy  (and  hanca  perception)  than  tha 
aana  arror  in  a LAR  with  a lo«rar  aanaitivity.  Surpriaingly, 
however,  our  axpariaantal  raaulta  ahowad  that  diffarant  choicaa 
of  thaaa  waighta  (a.g.,  all  equal  to  1)  produced  no 

diacarnibla  change  in  apaach  quality. 
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Na  found  through  axpariaantation  that  tha  sunation  in  tha 
fraaa  arror  dafinition  (4.1)  naad  ba  dona  only  up  to  tha  first  4 
LARS  (i.a. , ■■4) . 

Anothar  way  to  coaputa  tha  fraaa  arror  is  via  log  likalihood 
ratio  aaasura  axplainad  abova.  Our  axpariaants  (saa  Subsaction 
4.2.7)  indicatad  idantical  spaach  quality  rasults  for  tha  saaa 
avarage  transmission  frama  rata,  for  tha  two  maasurast  LAR 
distance  and  log  likalihood  ratio.  Sinca  LARs  ara  baing  usad  as 
transmission  paramatars,  use  of  tha  LAR  distanca  maasura  is 
computationally  much  lass  expensive  than  the  log  likelihood  ratio 
measure.  So,  wa  employed  tha  LAR  distance  in  all  our  subsequent 
experiments. 

Transmission  Error 

The  transmission  error  ET  between  frames  N and  M-fM  is 
computed  as  the  weighted,  time-averaged  frame  error i 

N4'M 

ET  - H Z W(n)E(n), 

" n-N+1 

where  N(n)  is  the  frame  weight  for  frame  n.  (Tha  upper  limit  for 
tha  summation  in  (4.2)  is  considered  as  N-fN  to  incorporate  tha 
affect  of  LAR  quantisation;  B(N<t>N)  is  computed  from  (4.1)  with  g^ 
denoting  quantised  LAR  values.)  As  frama  weight,  wa  successfully 
usad  tha  spaach  signal  energy  par  sample  in  that  frama,  Rl, 
axprassad  in  decibels  and  normalised  with  respect  to  soma 
estimate  RM  of  tha  maximum  value  of  RBt 
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M(n)  - M(n)/IUI(n).  (4.3) 

Tha  idaa  bahind  tha  aaighting  achaaa  givan  by  (4.3)  ia  that  avan 
larga  fraaa  arrora  do  not  aaka  a parcaptibla  affect  if  they  are 
aaaociatad  with  ralativaly  aaiall  apaach  aignal  energlaa.  For  our 
apaach  data  baaa,  where  wa  have  9-bit  aaaiplaa,  R0  ia  uaually 
around  35-40  dB  for  open  vowala,  15-30  dB  for  fricativaa,  and 
around  0-7  dB  for  the  ailant  period  of  an  unvoiced  ploaiva. 

A aiapla  and  efficient  way  to  update  RM  ia  by  tha  following 
racuraiva  aiathodt 

RM(n)  •Max{R0(n),  aRM(n-l)f  25  dB)  , (4.4) 

where  a ia  a conatant  laaa  than  1.  Na  uaa  a «f.9i»  which 
■aana  that  RM  decaya  to  half  ita  original  value  in  about  27 

fraaaa.  It  ahould  be  noted  froa  (4.3)  and  (4.4)  that  IKn)*!  if 

R0(n)>25  and  haa  bean  incraaaing  or  haa  bean  dacraaaing  alowlyi 
N(n)<l  if  R0(n)<25  or  if  R0(t)  haa  bean  dacraaaing  at  a faatar 
rate  than  exp(-0.98t). 

4.2.3  Parameter  Quantitation 

There  are  two  waya  in  which  the  affect  of  parameter 
quantitation  can  be  included  within  tha  above  procedure  for 
tranamiaaion  error  computation.  Both  waya  can  be  employed 
aimultanaoualy. 
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The  second  way  of  incorporating  paraneter  quantization  is 
what  %re  call  the  "adjustable*  quantization  aethod.  A paraaeter 
value  is  noraally  quantized  to  its  nearest  quantisation  level. 
The  adjustable  quantization  scheme  allows  either  of  the  two 
nearest  quantization  levels.  Thus,  given  the  quantised  LARs  of 
the  initial  fraae  (left  end-frame) , the  scheme  determines  the 
adjusted  quantized  values  of  the  LARs  for  the  final  frame  (right 
end-frame)  in  the  transmission  interval,  in  such  a way  that  the 
total  transmission  error  is  minimized. 

A one-dimensional  (p>m"l)  example  is  shown  in  Fig.  4.2  to 
illustrate  the  "adjustable”  quantization.  For  this  example,  the 


parameter  value 

of 

the 

sixth  frame 

is 

selected  for 

transmission.  If 

this 

value 

is  quantised 

to 

the  nearest 

quantiser  output  (the  output  just  below  it) , there  is 
considerable  interpolation  error  in  the  interval  between  frames  1 
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Fig.  4.2  Example  to  Illustrate  the  "adjustable"  quantisation 
scheme,  bashed-llne  plot  corresponds  to  normal 
quantization,  where  a parameter  value  is  quantised 
to  the  nearest  quantizer  output.  Solid  line  corres- 
ponds to  the  "adjustable"  quantisation  (see  text) . 

(The  dots  represent  the  original  unquantised  parameter 
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and  6.  If  the  higher  quantiser  output  is  used  instead,  the  total 
transmission  error  is  reduced.  (Fig.  4.2  also  shows  the 
interpolation  line  for  the  next  transmission  interval  from  frame 
6 to  frame  11.) 

t 

4.2.4  " Look-Ahead * Procedure 

Sometimes  the  transmission  error  may  temporarily  exceed  the 

4 

prespecified  threshold.  However,  if  the  transmission  interval  is 
lengthened,  the  error  may  drop  below  the  threshold.  An  example 
is  illustrated  in  Pig.  4.3.  In  Pig,  4.3a,  the  first  and  third 
frame  values  are  considered  to  be  transmitted;  in  Pig.  4.3b,  the 
first  and  the  fifth  frame  values  are  shown  as  being  transmitted. 
The  transmission  error  for  the  case  in  Pig.  4.3b  is  seen  to  be 
lower  than  for  the  case  in  Pig.  4.3a. 

we  call  the  above  feature  a "look-ahead"  feature.  The 
extent  of  "look-ahead"  (in  terms  of  number  of  frames  to  consider) 
is  limited  only  by  the  resulting  computational  burden;  we  use  a 
four-frame  "look-ahead"  procedure.  If  the  error  does  not  drop 
below  the  threshold  even  after  moving  forward  by  four  frames,  we 
hypothesize  the  transmission  of  the  frame  immediately  preceding 
the  one  where  the  threshold  was  first  exceeded. 
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Pig.  4.3  Example  to  Illustrate  "look- ahead”  procedure.  The 
dots  represent  the  original  unquantised  parameter 
values.  The  x's  represent  quantiser  output  values. 
The  vertical  dashed  lines  indicate  frames  chosen  for 
transmission. 

(a)  Without  the  "look-ahead"  scheme 

(b)  With  the  "look-ahead”  scheme 
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4.2.5  "Back-Op*  Proeadura 

Onca  wa  haaa  dataraiaad  thraa  auccaaaiva  tranaaiaaion  fraaas 
which  will  kaap  tha  tranaaiaaion  arrora  in  tha  two  tranaaiaaion 
intarvala  balow  tha  thraahold»  wa  than  raposition  tha  aiddla 
tranaaiaaion  fraaa  by  backing  up»  in  ordar  to  aininisa  tha  total 
tranaaiaaion  arror  in  both  intarvala.  (Mhan  uaing  tha  abova 
"adjuatabla*  quantiaation  with  tha  "back-up"  procedure,  wa 
coaputa  tha  total  arror  in  tha  two  tranaaiaaion  intarvala  by 
firat  coaputing  tha  "adjuatad"  quantixad  valuaa  for  tha  aacond 
and  third  tranaaiaaion  fraaaa.  Ihia  ia  illuatratad  in  Pig.  4.2, 
where  tha  thraa  auccaaaiva  tranaaiaaion  fraaaa  conaidarad  ara 
fraaaa  1,  6 and  11.) 

Pig.  4.4  illuatrataa  tha  "back-up"  procadura  by  way  of  an 
axaapla.  Tha  VPR  achaaa  initially  dacidad  to  trananit  fraaaa  3, 
8 and  13,  aa  ahown  in  Pig.  4.4a.  Tha  two  intarpolation  linaa  ara 
alao  ahown.  Pig.  4.4b  claarly  daaonatrataa  that  if  fraaa  7 wara 
tranaaittad  i^^^rad  of  fraaa  8,  tha  intarpolatad  valuaa  would 
aatch  tha  origin* 1 data  auch  aora  cloaaly  in  both  tranaaiaaion 
intarvala. 

4.2.6  Plow  Chart  of  tha  VPR  Schaaa 

^a  flow  chart  of  tha  VPR  achaaa  daacribad  in  tha  pravioua 
aubaactiona  ia  given  in  Pig.  4.5.  Variablaa  that  appear  in  tha 
flow  chart  ara  defined  in  Table  4.1.  A function  called  BRROR  ia 


U 


J: 


J 


44 


BBN  Report  No.  3794 


Bolt  Beranek  and  Newman  Inc 


1 2 3 4 5 6 7 B 9 10  11  12  13 

Frame  Number  — ► 


1 2 3 4 5 6 7 8 9 10  11  12  13 
Frame  Number 


Fig.  4.4  Example  to  illustrate  the  "back-up"  procedure.  Th 
dots  represent  the  original  unquantised  parameter 
values.  The  vertical  dashed  lines  indicate  frames 
chosen  for  transmission. 

(a)  Without  the  "back-up"  scheme 

(b)  With  the  "back-up"  scheme 
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Table  4.1 

I 

List  of  variables  Used  in  the  Flow  Chart  in  Fig.  4.5. 


LFRSNT  Last  frame  actually  transmitted. 

T0  First  frame  (left  end-frame  of  the  interpolation  line)  in 

a transmission  interval.  T0  equals  either  LFRSNT  or  a 
hypothesized  transmission  frame  when  using  the  "back-up" 
scheme . 

TN  Current  frame  (right  end-frame  of  the  interpolation  line) . 

ET  Transmission  error  between  the  original  unquantized  LAR 

data,  and  the  quantized  and  interpolated  values,  computed 
over  the  interval  from  frame  T|)  1 to  frame  TN  (see  eq.  (4.2) 

in  the  text) . 

8 Transmission  error  threshold.  Normally,  8-1.3 

IX3DFRM  Last  good  frame,  i.e. , last  frame  where  ET  < 8. 

LKAHED  Number  of  frames  to  "look  ahead"  beyond  the  frame  where 
ET  exceeds  8.  Normally,  LKAHED  - 4 frames. 

NAXDEL  Maximum  allowed  transmission  delay.  Without  the  "back-up" 
scheme,  it  is  the  maximum  transmission  interval  permitted. 
With  the  "back-up"  scheme,  it  is  the  maximum  allowed 
Interval  between  a transmitted  frame  (LFRSNT)  and  a frame 
which  is  the  second  hypothesized  transmission  frame  after 
LFRSNT  if  it  is  not  LGDFRM,  or  the  second  hypothesized 
transmission  frame  plus  LKAHED,  otherwise.  Normally, 

NAXDEL  - 12  frames  (12^  ms) . 

(X>PT  Quantized  LAR  values  resulting  from  the  "adjustable" 
quantization  scheme. 

T*  Frame  position,  determined  by  the  "back-up"  procedure. 
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used  to  compute  the  tranamiaaion  error  between  two  hypotheaiaed 
tranamiaaion  framea.  It  accepta  aa  input,  quantitation  levela 
for  the  LARa  at  the  firat  or  initial  frame,  and  determinea  the 
"adjuated"  aet  of  quantitation  levela  for  the  aecond  tranamiaaion 
frame.  If  the  function  ia  called  with  three  tranamiaaion  framea, 
it  providea  the  optimal  aet  of  quantitation  levela  for  the  aecond 
and  third  tranamiaaion  framea.  Bach  box  ahown  in  the  flow  chart 
tranalatea  into  one  or  two  FORTRAN  atatementa. 

4.2.7  Experimental  Reaulta 

We  teated  the  VFR  algorithm  deacribed  above  on  a aet  of  nine 
aentencea  (JBl,  DD2,  RS3,  AR4,  DK4,  JB5,  RS6,  DK6  and  DD6|  6 
aentencea  from  3 malea  and  3 aentencea  from  2 femalea)  from  the 
data  baae  uaed  in  our  apeech  quality  evaluation  work  (aee  Section 
7.2.1).  Table  4.2  deacribea  aix  vocoder  ayatema  and  liata  their 
average  tranamiaaion  frame  ratea  and  bit  ratea  obtained  over  the 
nine  aentencea.  We  ran  informal,  pair-wiae  apeech  quality 
compariaon  teata  on  the  ayntheaea  from  theae  aix  vocoder  ayatema, 
to  evaluate  the  relative  performance  of  the  different  veraiona  of 
the  above  VFR  acheme. 

Vocodera  1 and  2 given  in  Table  4.2  employed  the  full  101 
fpa  fixed-rate  tranamiaaion  for  all  parametera  (pitch, gain  and 
LARa) . vocoder  1 uaed  the  unquantiied  parametera  for  ayntheaia, 
while  vocoder  2 quantized  the  parametera  uaing  5 bita  for  gain,  6 

-48- 


1 

■ .1 


n 

BBH  Report  Mo.  3794  Bolt  Beranek  and  Hainian  Inc. 

i 

I 

bita  for  pitch  (plus  1 bit  for  Voicad/Unvoicad  status) , and  44 
I bits  for  LARs  of  voiced  frames  and  42  bits  for  lARs  of  unvoiced 

I frames^  which  resulted  in  a transmission  bit  rate  of  about  5650 

' I 

! bps.  Vocoders  3-6  quantised  the  parameters  in  the  same  way,  but 

employed  VPR  transmission  for  all  parameters.  For  pitch  and  gain 

I 

VPR  transmission,  they  all  used  the  double- threshold  PIT  scheme 
on  the  quantised  values  (levels)  (see  Section  4.3.2),  with 
thresholds  of  0 and  1 for  pitch,  and  1 and  2 for  gain;  this 
yielded  a transmission  frame  rate  of  about  28  fps  for  pitch  and 
24  fps  for  gain.  The  VPR  scheme  used  for  lAR  transmission  became 
progressively  complex  going  from  Vocoder  3 to  Vocoder  6,  with 
Vocoder  6 employing  the  complete  VPR  scheme  described  in  the  last 
subsection  via  flow  chart.  The  simplest  VPR  scheme  (used  by 
vocoder  3) , employs  the  quantized  lARs  of  the  end-frames  of  the 
interpolation  line  (see  Subsection  4.2.3).  For  all  the  four 
vocoders,  the  threshold  e (see  flow  chart  in  Fig.  4.5)  for  the 
transmission  error  ET  was  chosen  as  1.3.  (Me  chose  m-4  in  the 
expression  (4.1)  for  frame  error  since  it  yielded  the  same  speech 
quality  as  any  higher  value  but  at  the  least  computational 
effort.)  The  above  choice  of  the  transmission  error  threshold 
produced  an  average  frame  rate  of  about  25  fps  for  the  full 
scheme  (Vocoder  6)  and  an  average  transmission  error  (ET  averaged 
over  the  nine  sentences)  of  0.55. 
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Informal  teats  of  pair-wise  speech  quality  compariaona  were 
run  for  the  six  vocoders.  Also,  we  compared  the  full  VFR  scheme 
(Vocoder  6)  with  our  earlier  "end-to-end"  scheme  used  in  LPC-II 
and  with  the  50  fps  fixed-rate  scheme  used  in  LPC-I.  (The  latter 
two  vocoders  we  considered  were  not  LPC-II  and  LPC-I  in  view  of 
the  differences  in  vocoder  conditions  such  as  speech  signal 
sampling  rate,  bit  allocation  for  parameter  quantization,  and 
pitch  extraction  scheme.)  Below,  we  describe  the  results  of  only 
the  important  comparisons.  (Speech  quality  tests  comparing 
Vocoders  3-5  with  vocoder  6 are  given  in  Subsection  4.2.8.) 

1.  Vocoder  2 vs  Vocoder  6.  There  were  cases  for  which  speech 

transitions  were  more  "crisp"  for  vocoder  2 (5650  bps)  than 
for  Vocoder  6 (1650  bps).  However,  for  most  sentences 

(especially  the  slowly  varying  ones,  JBl  and  DD2) , the 
synthesized  speech  from  Vocoder  2 sounded  %n>rse  in  that  it 
had  appreciably  more  "«fobble"  quality  than  the  synthesis 
from  vocoder  6.  Our  explanation  for  the  observed  quality 
difference  is  that  for  the  cases  when  the  "wobble"  quality 
is  perceived,  the  error  due  to  parameter  quantisation  is 
more  than  the  error  due  to  parameter  interpolation. 

2.  Same  comparison  as  in  (1) , except  that  both  systems  used 
unquantised  parameters  in  the  synthesis  (i.e.,  vocoder  1 vs 
unquantised  version  of  vocoder  6) . The  syntheses  for  the 
slowly  varying  sentences  JBl  and  DD2  from  the  variable  rate 
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system  had  slightly  less  "wobble"  quality  than  those  from 
the  fixed  i > .e  system.  This  is  probably  due  to  the  fact 
that  small  inaccuracies  in  the  LPC  analysis  arising  from 
interaction  between  the  pitch  period  and  the  analysis 
interval  tend  to  be  reduced  by  the  smoothing  effect  of  the 
interpolation  employed  by  the  VPR  scheme.  There  were  a 
couple  of  situations  (during  the  part  "trouble  with"  in  the 
sentence  DK6)  where  the  fixed  rate  synthesis  sounded  better. 
In  general#  Vocoder  1 and  the  unquantized  version  of 
Vocoder  6 produced  speech  with  essentially  the  same  quality. 

3.  vocoder  1 vs  Vocoder  6.  Surprisingly#  the  results  of  this 
comparison  between  the  unquantized  100  fps  system  and  the 
1650  bps  VFR  system  were  the  same  as  given  above  in  (2) . 

4.  Vocoder  6 (1650  bps)  produced  speech  quality  equal  to  or 

better  than  that  of  the  VFR  system  with  the  earlier 

"end-to-end"  scheme  of  LPC-II  (2100  bps) . Speech  quality 
improvements  observed  in  the  syntheses  from  Vocoder  6 
included  clarity  and  "crispness"  of  several  syllables  which 
were  slurred  when  processed  through  the  earlier  VFR  system. 

5.  vocoder  6 (1650  bps)  was  compared  against  the  50  fps 

fixed-rate  system  (2825  bps) . LPC-I  also  uses  the  50  fps 
fixed-rate  transmission  but  operates  at  even  a higher  bit 
rate  of  about  3500  bps.  Although  the  50  fps  system  had  less 
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"wobble”  quality  than  the  100  fps  system  (Vocoder  2),  it 
still  had  a more  "«K>bble*  quality  than  vocoder  especially 
for  the  sentences  JBl  and  DD2. 

6.  Finally,  we  employed  the  log  likelihood  ratio  measure  for 
computing  the  frame  error  between  the  two  sets  of  lARs  £ and 
£,  instead  of  the  weighted  Euclidean  distance  measure  given 
by  (4.1).  (Notice  that  for  likelihood  ratio  computation, 
LARs  are  to  be  first  transformed  to  predictor  coefficients.) 
We  adjusted  the  transmission  error  threshold  ( e ) so  as  to 
obtain  about  the  same  average  frame  rate  (25  fps)  as  Vocoder 
6.  We  found  that  the  speech  quality  of  the  resulting 
vocoder  was  identical  to  that  of  vocoder  6.  This  result 
leads  to  the  following  two  observations.  First,  we  conclude 
that  the  superior  performance  of  the  new 
perceptual-model-based  VFR  scheme  over  the  earlier, 
"end-to-end”  scheme  of  LPC-II  (see  (4)  above) , is  not  due 
to  the  change  in  the  definition  of  the  frame  error,  but  due 
to  the  difference  in  the  way  the  transmission  error  is 
computed  in  each  case  (see  Fig.  4.1  which  illustrates  this 
difference) . Secondly,  we  recommend  the  use  of  the  LAR 
distance  measure  (4.1)  in  preference  to  the  log  likelihood 
ratio  measure,  since  the  use  of  the  latter  measure  requires 
about  50  times  more  computational  time. 


4.2.8  Simplified  VPR  Scheme 

Though  the  algorithm  described  above  produced  very  low  frame 
rates  and  good  quality  speech , it  has  the  disadvantage  of  being 
fairly  complex,  and  somewhat  slower  than  real  time  in  our 
simulation  on  a KL-10  computer.  Of  course  it  could  be  coded  to 
run  in  real  time  on  a fast  m ini- compute r , but  might  not  leave 
enough  time  for  other  processing  needs.  Therefore,  «re  tried 
several  simplifications  of  the  algorithm,  in  order  to  arrive  at  a 
reasonable  compromise  between  speed,  ccmplexity,  frame  rate  (and 
bit  rate)  and  speech  quality. 

Our  first  simplification  (see  Table  4.2,  Vocoder  5)  involved 
the  adjustable  quantization.  Instead  of  allowing  two  possible 
quantisation  levels  for  each  lAR  of  every  hypothesised 
transmission  frame,  the  LAR  values  were  always  quantised  to  the 
nearest  levels.  This  sped  up  the  algorithm  by  a factor  of  4,  and 
reduced  the  complexity.  The  transmission  frame  rate  (for  the 
same  transmission  error  threshold)  rose  to  about  27  fps.  However 
the  resulting  sentences  were  indistinguishable  from  those 
produced  by  the  scheme  with  adjustable  quantisation. 

For  the  second  simplification  we  eliminated  the  "back-up" 
procedure  (Vocoder  4).  The  frame  rate  remained  unchanged  at  27 
fps,  but  the  average  measured  transmission  error  increased  by 
about  20%.  Careful,  repeated  listening  through  headphones 
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revealed  only  a slight  degradation  for  tiro  sentences.  The 
differences  i«ere  not  perceived  through  high  quality  loudspeakers, 
and  were  not  noticed  on  single  paired-comparisons  through 
headphones.  This  simplification  sped  up  the  algorithm  by  another 

K\ 

factor  of  3,  and  reduced  the  complexity  considerably. 

The  third  simplification  was  the  removal  of  the  *look-ahead" 
procedure  (vocoder  3) . That  is,  as  soon  as  the  transmission 
error  computed  over  the  interval  from  tho  preceding  transmitted 
frame  to  the  current  frame  exceeded  the  threshold,  the  frame 
immediately  preceding  the  current  one  was  chosen  to  be 
transmitted.  As  expected,  this  increased  the  frame  rate 
substantially  (to  30  fpa) , for  the  same  speech  quality.  Nhen  the 
"look-ahead”  procedure  enabled  the  algorithm  to  skip  over  a bad 
region,  the  transmission  intervals  %#ere  greatly  lengthened.  The 
simplification  reduced  processing  time  by  about  30%,  and 
eliminated  only  3 lines  of  FORTRAN  code. 


Recommended  Scheme 


D 

D 

0 

D 

[] 


Vlhile  the  full  scheme  (Vocoder  6)  clearly  results  in  a lower 
frame  rate  and  slightly  better  speech  quality,  it  is  much  more 
complex  and  an  order  of  magnitude  slower  than  the  simplest  scheme 
(without  "adjustable"  quantization,  and  "back-up"  and 
"look-ahead"  features) . The  first  two  simplifications  discussed 
above  seem  reasonable,  since  the  resulting  loss  was  small.  The 


I 
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last  faatura  ("look-ahaad")  is  rscosaandad , sines  its  reaoval 
rssultsd  in  substantial  lossss  and  produced  only  minor  gains. 

Pig.  4.4  shows  a flow  chart  of  the  recommended  VPR  scheme 
(Vocoder  4).  Comparison  of  this  simplified  scheme  with  Fig.  4.5 
will  make  the  difference  in  complexity  apparent. 

Of  course » if  the  computer  running  the  VPR  algorithm  is  fast 
enough f and  easy  to  program,  it  may  be  worth  the  extra  trouble  to 
implement  the  full  scheme,  which  includes  the  features  of 
"adjustable”  quantisation  and  "back-up”. 

4.3  Transmission  of  Pitch  and  Gain 

We  have  developed  two  types  of  VPR  schemes  for  the 
transmission  of  pitch  and  gain.  These  are;  (1) 
"Floating-Aperture  Predictor,"  which  performs  an  "end-to-end" 
comparison  between  the  parameter  values  of  the  current  frame  and 
the  last  transmitted  frame,  and  (2)  "Pan  Interpolation 
Technique",  which  explicitly  takes  advantage  of  the  fact  that  the 
receiver  performs  linear  interpolation  for  the  reconstruction  of 
untransmitted  data.  The  results  of  our  investigation  on  these 
t«io  types  of  schemes  are  given  beloir. 

4.3.1  Floating  Aperture  Predictor  (PAP) 

VPR  transmission  schemes  of  the  PAP  type  have  been  described 
in  detail  in  our  NSC  Note  Mo.  96  [8].  We  developed  both 
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Fig.  4.6  Flow  chart  of  recommended,  eimplif led  algorithm. 
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single-threshold  and  double-threshold  VFR  schemes  for  the 
transmission  of  pitch  and  gain.  (As  LPC  gain  parameter,  we 
transmit  per-sample  energy  in  decibels  of  the  unpreemphasised 
speech.)  The  single- threshold  scheme  transmits  the  parameter 
value  (pitch  or  gain)  for  a given  frame  if  the  absolute 
difference  between  the  value  and  the  preceding  transmitted  value 
exceeds  a prespecified  threshold.  The  double-threshold  scheme 
follows  the  same  rule,  except  that  it  instead  transmits  the 
parameter  value  for  the  frame  immediately  preceding  the  present 
frame  if  the  above  absolute  difference  exceeds  a prespecified 
second  (higher)  threshold i as  in  LAR  transmission  above,  this 
avoids  the  need  to  do  parameter  interpolation  at  the  receiver 
between  largely  different  data  frames.  In  [8]  we  have 
recommended  the  use  of  specific  double- threshold  VFR  schemes  on 
quantised  pitch  and  gain  data  for  ARPA-LPC  System  ll.  These 
schemes  would  reduce  the  average  transmission  frame  rate  from  the 
analysis  rate  of  Iff  fps  to  about  35  fps  for  pitch  and  32  fps  for 
gain. 

The  above-mentioned  single-threshold  scheme  is  similar  to 
the  so-called  "floating-aperture  predictor"  which  has  been  used 
for  data  compression  in  telemetry  applications  [9, If].  The  main 
difference  between  the  two  is  in  the  way  data  reconstruction 
takes  place  at  the  receiver  i.e.,  how  the  untransmitted  parameter 
values  are  approximated.  The  traditional' FAP  method  employs  a 
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stair-step  reconstruction  in  that  a transmitted  value  is  held 
constant  for  all  the  frames  up  to  the  next  transmission,  where 
the  value  is  instantaneously  updated  to  be  the  next-transmitted 
value.  Our  single-threshold  scheme,  however,  performs  linear 
interpolation  between  adjacent  transmitted  values  to  generate  a 
smoother  approximation.  (The  double-threshold  scheme  has  the 
same  feature,  except  that,  as  mentioned  above,  it  produces  less 
interpolation  error  at  the  expense  of  a slight  increase  in  frame 
rate.)  It  is  felt  that  in  speech  resynthesis  applications  the 
smooth  approximation  produced  by  interpolation  should  produce 
less  speech  quality  distortion  (e.g.,  "roughness”)  than  the 
stair-step  approximation  used  in  the  FAP  method.  However,  at  the 
transmitter,  our  VFR  scheme  (hereafter  loosely  called  as  FAP 
scheme)  does  not  explicitly  take  advantage  of  the  fact  that  the 
receiver  performs  linear  interpolation  for  data  reconstruction. 
The  inclusion  of  this  feature  may  perhaps  yield  further  data 
compression.  To  this  end,  we  have  adapted  the  so-called  ”fan 
interpolation"  technique  that  has  been  used  once  again  in 
telemetry  applications  [9,10]. 

4.3.2  Fan  Interpolation  Technique  (FIT) 

Single-Threshold  Scheme > The  FIT  method  previously  used  in  the 
literature  (9, If]  is  indeed  a single-threshold  scheme.  The 
method  relies  on  the  approximation  of  the  analysis  or  source  data 
by  straight  line  segments  and  transmits  only  those  parameter 
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values  corresponding  to  the  end  frames  of  these  segments.  Given 
some  initial  transmitted  frame r it  finds  the  longest  line  for 
which  the  maximum  error  magnitude  between  the  line  and  the  data 
over  the  length  of  the  line  is  below  a given  threshold.  We 
treated  the  case  where  quantized  parameter  values  (levels)  are 
used  for  deciding  when  to  transmit.  In  computing  the  error 
between  the  quantized  parameter  level  for  a frame  and  the 
interpolation  line,  we  compute  the  interpolated  value  for  that 
frame,  round  it  off  to  the  nearest  (integer)  level  and  then  find 
the  difference  between  this  and  the  actual  quantized  parameter 
level  for  that  frame.  (Rounding  is  done  such  that  if  the 
fractional  part  of  the  interpolated  value  is  equal  to  or  greater 
than  0.5  then  it  is  rounded  up,  otherwise  it  is  rounded  down.) 
At  the  receiver,  quantized  levels  for  untransmitted  frames  are 
generated  by  interpolating  between  the  adjacent  transmitted 
levels  and  rounding  off  the  interpolated  value  to  the  nearest 
level  as  explained  above. 

A step-by-step  description  of  the  FIT  single- threshold 
scheme  is  given  in  Fig.  4.7  below,  where  I„  denotes  the  quantized 
level  of  the  parameter  for  frame  n,  the  symbol  [ ] refers  to  the 
above  rounding  operation,  and  T is  the  preselected  threshold. 


j : 
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(1)  Transmit  value  at  frame  n 
m 2 

(2)  k 1 

(3)  P (m-k)/m  1^^  + k/m 

If  E < T,  go  to  (4) 
n n+m-1 
Go  to  (1) 

(4)  k k+1 

If  k £ m-1,  go  to  (3J 

(5)  (No  treunsmlsslon) 
m m+1 

Go  to  (2) 

Fig.  4.7  Description  of  our  FIT  single-threshold  scheme 


It  is  clear  from  step  (3/  that  with  frames  n and  (n+m)  as  end 
frames f the  scheme  looks  at  the  magnitude  of  the  interpolation 
error,  in  order,  from  frame  (n-fl)  to  (n+m-1)  and  decides  to 
transmit  frame  (n+m-1)  value  at  the  first  instance  the  error 
magnitude  exceeds  T. 


If  T>0,  it  is  easily  seen  that  the  receiver  has  the  same 
parameter  data  as  at  the  output  of  the  quantiser.  The  same 
result  is  also  achieved  using  the  FAP  method  with  a sero 
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threshold  and  with  stair-step  reconstruction  at  the  receiver. 
Average  transmission  frame  rates  produced  by  the  two  methods  can, 
however,  be  different i the  extent  of  this  difference  depends  upon 
the  nature  of  the  data,  in  this  case  quantized  parameter  levels. 
For  instance,  if  the  data  has  frequent  occurrence  of  sequences  of 
equal  levels  (i.e.,  presence  of  horizontal  or  level  lines),  then 
the  FAP  scheme  would  generally  do  better  yielding  a lower  frame 
rate  than  the  FIT  method;  the  reason  for  this  is  that  the  latter 
method  transmits  both  end  frames  for  each  level  line,  while  the 
former  transmits  only  the  first  end  frame.  On  the  other  hand,  if 
the  data  involves  a large  number  of  sloped  or  nonlevel  lines  then 
the  opposite  result  is  true  in  that  the  FIT  method  yields  a lower 
frame  rate. 


Experimental  results  obtained  using  the  above  FIT  method  on 
quantized  pitch  and  gain  are  reported  in  the  sequel. 

Double-Threshold  Scheme:  The  double-threshold  version  of  the  FIT 
method  operates' as  follows.  Assume  that  frames  n and  (n-fm)  are 
the  end  frames  of  the  interpolation  line  under  consideration. 
Then,  (1)  if  the  maximum  interpolation  error  magnitude  over  the 
length  of  the  line  exceeds  the  second  (higher)  threshold  T2,  then 
frame  (n-t-m-l)  value  is  transmitted;  (2)  if  the  maximum  error 
magnitude  exceeds  the  first  (lower)  threshold  Tl,  and  not  T2, 
then  frame  (n-t-m)  value  is  transmitted;  (3)  if  the  maximum  error 
magnitude  does  not  exceed  Tl,  then  a new  interpolation  line  is 
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considered  between  frames  n and  (n-f-m-t-l)  , and  the  entire  procedure 
is  repeated.  A step-by-step  description  of  the  double-threshold 
scheme  is  given  in  Fig.  4.8. 

For  our  earlier  FAP  scheme,  the  motivation  to  use  the 
double-threshold  scheme  has  been  to  improve  the  accuracy  of 
parameter  interpolation  performed  at  the  receiver  between 
adjacent  transmitted  values.  The  same  motivation  does  not  hold 
for  the  above  FIT  method,  since  it  explicitly  considers 
interpolation  error  as  part  of  its  transmission  strategy.  Why, 
then,  should  one  consider  the  FIT  double-threshold  scheme?  The 
answer  may  be  given  as  follows.  Considering  quantized  parameter 
data,  the  FIT  single- threshold  scheme  allows  only  integer 
thresholds,  in  effect,  the  double- threshold  scheme  may  be  viewed 
as  equivalent  to  a single-threshold  scheme  that  can  allow  a 
noninteger  threshold.  For  example,  the  (0,1)  double- threshold 
scheme  produces  average  frame  rate  and  speech  quality  that  lie 
between  those  of  the  two  single-threshold  schemes  with  thresholds 
0 and  1.  This  point  will  be  more  clear  from  the  experimental 
results  provided  below. 

Experimental  Results 

Below,  we  report  experimental  results  obtained  using  the  FIT 
method  on  the  quantized  pitch  and  gain  data.  Our  speech  data 
base  consisted  of  a total  of  11  utterances,  representing  about  25 
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(1)  Transmit  value  at  frame  n 
m 2 

(2)  Flag  0 
k 1 

(3)  P (m-k)/m  + k/m  I^^^ 

" ll''!  - W 

If  E < T2,  90  to  (4) 
n n+m-1 
Go  to  (1) 

(4)  If  E < Tl,  go  to  (5) 

Flag  1 

(5)  k k+1 

If  k < m-1,  go  to  (3) 

(6)  If  Flag  » 0»;go  to  (7) 
n n+m 

Go  to  (1) 

(7)  (No  transmission) 
m m+1 

Go  to  (2) 

Fig.  4.8  Description  of  our  FIT  double-threshold  scheme 
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seconds  of  speech,  from  5 male  and  5 female  speakers.  This  data  j 

base  Is  the  same  as  the  one  used  for  computing  average 
transmission  frame  rate  data  for  our  earlier  FAP-type  VPR  schemes 
in  [8] . 

Pitch:  The  FIT  single-threshold  scheme  produced  average  frame 
rates  of  35,  18  and  14  fps  for  values  of  the  threshold  T>0,  1 and 
2,  respectively.  using  the.  (0,1)  double-threshold  scheme,  we 
obtained  an  average  frame  rate  of  26  fps.  This  latter  rate 
should  be  compared  against  the  rate  of  35  fps  that  «re  had 
reported  for  our  earlier  (0,1)  FAp  scheme  [8]. 


Gain:  The  FIT  single-threshold  scheme  produced  average  frame 
rates  of  57,  31  and  22  fps  for  values  of  the  threshold  Ta0,  1 and 
2,  respectively.  using  the  FIT  double-threshold  scheme,  we 
obtained  average  frame  rates  of  41,  26  and  19  fps  for  the  two 
thresholds  (T1,T2)- (0,1) , (1,2)  and  (2,3),  respectively.  In 
contrast,  the  (2,3)  double- threshold  FAP  scheme  produced  an 
average  frame  rate  of  32  fps  (8). 

With  the  objective  of  not  tolerating  any  speech  quality 
loss,  we  have 'Chosen  to  employ  the  single-threshold  FIT  scheme 
with  the'  threshold  T«0  for  pitch  transmission,  and  the  (0,1) 
double- threshold  FIT  scheme  for  gain  transmission.  The  use  of 
the  (0,1)  double-threshold  scheme  for  pitch  and  the  (1,2) 
double-threshold  scheme  for  gain  yielded  only  a small  speech 


-65- 


BBM  Report  No.  3794 


Bolt  Beranek  and  Newman  Inc 


quality  loss,  which  consisted  mainly  of  occasional  "roughness* 
(gain-related)  and  a couple  of  "clicks"  (pitch-related)  over  the 
data  base  of  36  sentences  given  in  Section  7.2.1. 

4.4  Discussion  and  Recommendations 

4.4.1  Transmission  of  Timing  Information 

With  VFR  transmission  of  a parameter,  it  is  necessary  to 
transmit  timing  information  to  indicate  to  the  receiver  the 
length  of  transmission  interval  between  successive  transmissions. 
To  this  end,  we  proposed  in  NSC  Note  No.  82  [7]  that  a 3-bit 
header  be  transmitted  for  every  analysis  frame.  The  first  header 
bit  is  1,  only  if  pitch  is  transmitted  for  that  frame 7 similarly, 
the  second  and  third  header  bits  are  used  to  indicate  if  gain  and 
LARs,  respectively,  are  transmitted  for  that  frame.  This 
proposal  of  transmitting  a 3-bit  header  allows  the  use  of  a 
separate  transmission  criterion  for  each  of  the  three  parameter 
groups:  pitch,  gain,  and  LARs,  and  hence  accommodates  one  of  the 
postulates  of  our  perceptual  model  of  speech. 

4.4.2  Recommendations 

Experimentally  we  found  that  the  following  VFR  system 
yielded  the  maximum  data  compression  without  compromising  speech 
quality  relative  to  the  full  100  fps  fixed-rate  system.  In  this 
experiment,  an  11-th  order  LPC  analysis  was  performed)  the  11 


-66- 


BBN  Report  No.  3794 


Bolt  Beranek  and  Newaan  inc 


LARs  were  quantized  using  46  bits,  which  were  allocated  fron  the 
first  to  the  11-th  coefficient  as  6, 5, 5, 4, 4, 4, 4, 4, 4, 3, 3 bits^ 
pitch  and  gain  were  quantized  using  6 and  5 bits  respectively; 
the  simplified  VPR  scheme  given  in  Section  4.2.8  was  used  for  LAR 
transmission. 

Recommended  VFR  System 

LARs I Threshold,  e-1.0 

Pitch:  Single-threshold  FIT  scheme  with  threshold  T*?!  (see 
Section  4.3.2) 

^ain:  Double-threshold  FIT  scheme  with  thresholds  T1"0  and 
T2«l  (see  Section  4.3.2) 

The  above  vocoder  is  referred  to  as  PMH  in  Section  7.3.3,  and  it 
yielded  the  following  transmission  statistics  for  the  36-sentence 
(6  sentences  x 6 speakers)  data  base  given  in  Section  7.2.1.  The 
LAR  transmission  frame  rate  computed  over  individual  sentences 
varied  from  a maximum  of  44  fps  to  a minimum  of  14  fps,  with  an 
average  of  31  fps.  The  average,  maximum  and  minimum  transmission 
frame  rates  for  pitch  were:  34,  43  and  25  fps  respectively,  and 
those  for  gain  were:  40,  54  and  24  fps  respectively.  The  bit 
rate  varied  from  a maximum  of  2817  bps  to  a minimum  of  1274  bps, 
with  an  average  of  2120  bps.  This  average  bit  rate  of  2120  bps 
for  the  above  VFR  system  should  be  contrasted  with  the  bit  rate 
of  5700  bps  for  the  full  100  fps  fixed-rate  system.  With  the 
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5.  SYNTHESIS 

In  this  section,  we  report  the  results  of  cur  work  on  the 
following  three  items:  optimal  linear  interpolation  of 
synthesiser  parameters,  gain  implementation,  and  all-pmss 
excitation. 

5.1  Optimal  Linear  Interpolation 

In  narrowband  LPC  speech  compression  systemm,  the  proomsm  of 
parameter  interpolation  at  the  receiver  helps  in  Mootliinf  tiM 
roughness  in  the  synthesized  speech  which  is  noraslly  sssocistsi 
with  infrequent  parameter  updating.  Simple  linssr  intsrpslstism 
(SLI)  has  been  used  almost  exclusively  in  these  systSM.  Is  sm 
earlier  study  we  found  that  the  spectrsl  error  Arne  to 
interpolation  was  much  larger  than  the  error  due  to  fmemtleet lom 
[1].  This  result  suggests  that  better  peraeMter  Imterpoletlom 
approaches  than  the  simple  linear  scheme  should  be  invest ifoted . 
With  this  motivation,  we  developed  an  optiael  limeer 
interpolation  (OLD  scheme  that  requires  the  trenaaiission  of  en 
extra  parameter  per  data  frame,  value  of  e is 
determined  as  that  point  along  the  line  used  for  linear 
interpolation  which  is  closest  (in  the  mean  square  sense)  to  the 
point  determined  by  the  actual  parameter  values  at  the  instance 
where  Interpolation  is  desired.  The  transmission  of  a requires 
5B-150  bits/ sec,  depending  on  the  frame  rate  and  the  number  of 
bits  used  for  quantizing  a. 
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Theoretical  and  experimental  results  that  we  obtained  with 
the  new  interpolation  scheme  have  been  presented  in  detail  in  a 
BBN  report  [11] , which  was  also  issued  as  NSC  Note  No.  59. 
Briefly,  theoretical  results  showed  that  in  the  space  of 
parameter  vectors,  the  OLI  scheme  corresponds  to  an  orthogonal 
projection  of  the  actual  parameter  vector  at  the  interpolation 
point  onto  the  line  passing  through  the  two  parameter  vectors 
that  are  used  in  the  interpolation.  Several  ways  of  using  the 
OLI  scheme  with  a variable  frame  rate  transmission  system  are 
also  given.  Experimental  results  showed  that  the  OLI  scheme 
improved  speech  quality  relative  to  the  SLI  scheme,  especially 
dmrinf  rapid  transitions  in  the  speech  signal.  In  addition  to 
informal  listening  tests,  we  investigated  the  waveforms  and 
apoetrograms  of  synthesised  speech  with  OLI,  and  the  time  history 
of  the  spectral  error,  in  our  experience,  the  optimal  scheme  is 
moot  advantageous  when  used  with  low  bit  rate,  variable  frame 
rate  transmission  systems. 

9.2  Osin  Implementation 

Me  investigated  three  issues  involving  linear  predictor  gain 
parameter.  The  first  issue  was  the  choice  of  the  gain  parameter 
for  transmission)  we  discussed  this  issue  in  Section  3.3.  The 
second  issue  considered  the  problems  associated  with  implementing 
the  speech  signal  energy  as  a multiplier  at  the  output  of  the 
synthesiser  filter  instead  of  the  more  commonly  used  method  of 
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applying  it  at  the  filter  input.  The  third  issue  was  the 
treatment  of  cases  for  which  speech  signal  energy  had  values  less 
than  1 (or  negative  when  considered  in  decibels) . Below,  we 
describe  our  work  on  the  second  and  the  third  issues. 

The  use  of  the  normalized  filter  [4]  (see  Section  3.3)  is 
recommended  for  implementation  of  the  synthesizer  on  the  SPS-41 
for  many  reasons,  such  as  better  round  off  noise  and  scaling 
properties,  the  availability  of  sine  and  cosine  tables  in  the 
SPS-41,  etc.  Placing  the  gain  multiplier  at  the  output  of  the 
normalized  filter  rather  than  at  its  input  serves  to  alleviate 
dynamic  range  problems.  However,  care  has  to  be  exercised  in 
implementing  the  speech  signal  energy  at  the  output  of  either  the 
normalized  filter  or  the  regular  filter.  The  difficulty,  implied 
in  the  above  statement,  arises  from  the  nonzero  initial 
conditions  of  the  filter.  Whenever  there  is  a relatively  large 
change  in  speech  signal  energy  from  one  frame  to  the  next,  say, 
of  the  order  of  10  dB,  then  the  synthesized  speech  is  found  to 
have  signal  amplitudes  quite  different  from  those  of  the  original 
input  speech.  For  example,  in  an  unvoiced- voiced  transition,  the 
first  voiced  frame  in  the  synthesized  speech  has  relatively  large 
signal  amplitudes  compared  to  the  original  speech.'  we  showed 
both  experimentally  and  mathematically  that  this  problem  was  due 
to  the  nonzero  initial  conditions  of  the  filter.  When  listening 
to  speech  synthesized  with  speech  signal  energy  implemented  at 


1 


BBN  Report  No.  3794  Bolt  Beranek  and  Newman  Inc. 

the  output  of  the  synthesiser  filter,  we  perceived  these 
distortions  in  signal  amplitudes  as  annoying  "knock  sounds*.  A 
solution  to  the  problem,  which  we  found  to  be  satisfactory,  is  to 
zero  the  initial  conditions  whenever  the  absolute  frame- to- frame 
energy  change  exceeds  a given  threshold  (about  12  dB) . with  this 
method,  the  distortions  in  signal  amplitudes  which  caused  the 
perception  of  "knock  sounds*  were  eliminated. 

In  logarithmically  quantizing  speech  signal  energy  we  used  a 
range  of  0 to  45  dB.  Any  signal  energy  less  than  0 dB  was 
quantized  as  0 dB.  From  synthesis  experiments  we  found  that  this 
strategy  of  raising  the  energy  from  a negative  dB  value  to  0 dB 
produced  relatively  large  perceivable  noise  during  stop  sounds, 
pauses  and  silences.  This  led  us  to  quantize  energy  values  less 
than  or  equal  to  0 dB  as  a given  negative  dB.  We  found  through 
listening  tests  that  when  we  used  a large  negative  dB  value,  the 
beginnings  of  certain  speech  sounds  (e.g.,  [h] , [n] , [d])  were 
somewhat  cut  off.  By  experimentation,  we  found  a value  of  -3  or 
-4  dB  to  be  satisfactory. 

5.3  All-Pass  Excitation 


With  the  use  of  the  pulse/noise  excitation  for  the 
minimum-phase  LPC  synthesizer,  the  synthesized  speech  was  found 
to  have  larger  peak  amplitudes  than  the  natural  speech  used  in 
the  analysis.  To  accommodate  this  situation,  we  have  used  9 bits 
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to  store  input  or  natural  speech  samples r and  12  bits  to  store 
synthesized  speech  samples.  since  the  full  dynamic  range 
possible  with  12  bits  was  not  effectively  used  in  storing  the 
synthesized  speech  samples,  the  signal-to-noise  (noise  at  the  D/A 
converter)  ratio  was  lower,  producing  sometimes  less  desirable 
audio  quality  at  the  output  of  the  D/A  converter.  To  overcome 
this  problem,  we  employed  an  all-pass  excitation  as  described 
below. 

We  chose  an  8th  order  all-pass  filter  given  in  [12] , which 
was  specifically  designed  to  minimize  the  peak  amplitude  of  its 
impulse  response.  All-pass  excitation  signal  can  be  obtained  by 
filtering  the  pulse/noise  excitation  signal  through  this  all-pass 
filter.  To  simplify  computations,  however,  we  precomputed  once 
at  the  start  40  samples  (4  ms  at  10  kHz  sampling  rate)  of  the 
impulse  response  of  the  all-pass  filter  and  stored  them  in 
memory.  If  a given  frame  was  unvoiced,  we  used  the  random  noise 
sequence  directly  as  the  excitation  signal  (i.e.,  no  all-pass 
filtering  was  done);  this  strategy  worked  fine  since  high  peak 
amplitudes  occurred  only  in  voiced  speech.  For  a voiced  frame, 
we  chose  one  of  the  following  two  cases,  depending  on  the  value 
of  the  pitch  period  for  that  frame:  1)  If  pitch  period  was  longer 
than  4 ms,  we  took  the  40  samples  of  the  all-pass  impulse 
response  and  appended  at  the  end  with  the  required  number  of 
zeros  to  generate  the  excitation  signal.  2)  If  pitch  period  was 
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shorter  than  4 ffls«  we  used  the  "aliased*  version  of  the  40-sajBple 
sequence  which  was  obtained  by  considering  the  periodic 
occurrence  of  this  sequence  at  a rate  given  by  pitch  frequency. 

By  conducting  synthesis  experiments,  we  found  that  peak 
amplitudes  were  in  fact  lowered  when  using  the  specific  ai^l-pass 

excitation  discussed  above.  Even  in  this  case,  however,  \ peak 

\ 

amplitudes  of  synthesized  speech  were  higher  than  those  of  the 

'\ 

natural  speech;  the  increase  in  peak  amplitudes  due  to  synths^sis 

was  often  found  to  be  about  6 dB  or  less.  We  accommodated  this 

\ 

increase  by  using  11-bit  natural  speech  samples,  and  12-bit 
synthesized  speech  samples.  Using  this  approach,  the  audio 
quality  of  speech  at  the  output  of  the  D/A  converter  was  found  tb 
be  better  than  what  we  had  previously  found.  We  used  this 
approach  in  generating  stimuli  for  subsequent  subjective  quality' 

i 

tests. 
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6.  A MIXED-SOURCE  MODEL 

We  developed  a new  model  for  generating  the  excitation 
signal  for  the  synthesizer  of  the  narrowband  LPC  vocoders,  with 
the  objective  of  enhancing  the  naturalness  of  the  synthesized 
speech.  Most  present-day  narrowband  vocoders  employ  an  idealized 
source  (or  excitation)  model,  which  is  either  a sequence  of 
quasi-per iodic  pulses  for  voiced  sounds,  or  white  noise  for 
unvoiced  sounds.  This  voiced/unvoiced  model  seems  to  be  largely 
responsible  for  the  ”buzziness”  and  lack  of  naturalness  perceived 
in  the  resulting  synthesized  speech.  Our  new  source  model, 
called  mixed-source  model,  combines  both  pulse  and  noise  sources 
in  a novel  way.  Based  on  the  observation  that  spectra  of  voiced 
speech  sounds  (e.g.,  voiced  fricatives  and  even  certain  vowels) 
exhibit  devoiced  or  incoherent  high  frequency  bands,  the  model 
divides  the  spectrum  into  a low  frequency  region  and  a high 
frequency  region,  with  the  pulse  source  exciting  the  low  region 
and  the  noise  source  exciting  the  high  region.  The  cutoff 
frequency  F^,  that  separates  the  two  regions  is  adaptively  varied 
in  accordance  with  the  changing  speech  signal. 

The  mixed-source  model  is  described  in  detail  in  a paper 
which  is  included  in  this  report  as  Appendix  7.  As  depicted  in 
Pig.  4 of  that  paper,  the  outputs  of  the  low-pass  and  high-pass 
filters  are  added,  multiplied  by  the  source  gain  and  applied  to 
the  synthesizer  as  the  excitation  signal.  For  unvoiced  sounds 
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(F^>0) , the  model  employs  a pure  noise  excitation.  Since  small 
changes  in  are  not  perceptible,  it  is  sufficient  to  quantize 
F^,  into  2-3  bits  for  transmission  purposes. 

The  cutoff  frequency  F^  is  a continuous  parameter,  and  so 
errors  in  the  extraction  of  F^,  degrade  the  quality  of  the 
synthesized  speech  much  more  gracefully  than  the  errors  in  the 
binary  voicing  parameter  of  the  voiced/unvoiced  model.  Thus,  the 
mixed-source  model  promises  to  be  a more  robust  source  model. 

We  developed  a method  for  automatically  extracting  F^  from 
the  speech  signal.  The  method  is  a peak-picking  algorithm  on  the 
signal  spectrum.  It  determines  periodic  regions  of  the  spectrum 
by  examining  the  separation  between  consecutive  peaks  and 
determining  whether  the  separations  are  the  same,  within  some 
tolerance  level.  F^,  is  taken  to  be  the  highest  frequency  at 
which  the  spectrum  is  considered  periodic. 

In  our  implementation  of  the  mixed-source  model,  we  rounded 
the  automatically  extracted  value  of  F^,  to  the  nearest  500  Hz. 
Therefore,  we  needed  low-pass  and  high-pass  filters  with  cutoff 
frequencies  separated  by  500  Hz.  For  each  value  of  F^,,  the  3 dr 
points  for  the  low-pass  and  high-pass  filters  were  designed  to  be 
equal  to  F^.,  in  order  that  the  spectrum  of  the  final  excitation 
may  be  as  flat  as  possible.  The  roll-off  of  the  filters  was 
considered  to  be  of  secondary  importance,  but  should  not  be  very 
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sharp  in  any  case.  We  considered  FIR  (finite  impulse  response) 
as  well  as  recursive  (low  order  Butterworth)  filter  designs.  The 
filter  designs  were  stored  and  used  in  the  synthesis  as  the  need 
arose.  Both  FIR  and  recursive  filter  designs  gave  similar 
perceptual  results. 

Results  of  synthesis  experiments  conducted  to  test  the 
effectiveness  of  the  mixed-source  model  are  given  in  Appendix  7. 
Briefly!  the  model  was  found  to  largely  eliminate  the  "buxsy" 
quality  of  vocoded  speech » perform  better  for  female  speech » and 
result  in  a certain  "fullness*  in  perceived  speech  quality  that 
was  absent  with  the  voiced/unvoiced  synthesis. 
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7.  SUBJECTIVE  SPEECH  QUALITY  EVALUATION 


7.1  Introduction 


We  describe  our  work  on  subjective  quality  evaluation  in 
three  parts.  Section  7.2  describes  the  development  and  testing 
of  an  improved  method  for  measuring  subjective  quality,  using  our 
Phoneme-Specific  sentence  materials.  Section  7.3  describes  three 
applications  of  this  method  to  practical  problems: 
(1)  Determining  parametrically  how  subjective  quality  depends  on 
vocoder  parameters,  specifically  a)  the  order  of  the  linear 
predictor  (number  of  poles) , b)  the  step  sise  for  quantization  of 
the  LPC  coefficients  (log  area  ratios  or  LARs) , and  c)  frame 
transmission  rate.  In  addition  to  their  usefulness  in  vocoder 


design  decisions,  these  data  were  also  needed  for  the  development 
of  our  objective  method  for  assessing  speech  quality  (see  Section 
8) . (2)  Proving  that  a given  reduction  of  bit  rate  is  achieved 
at  a much  smaller  cost  in  reduced  quality  if  the  bit  rate  is 
reduced  by  substituting  a variable  for  a fixed  transmission 
schedule,  rather  than  reducing  the  predictor  order,  or  coarsening 
quantization  of  the  coefficients.  (3)  Demonstrating  formally  the 
superior  quality  and  low  bit  rate  of  our  perceptual-model-based 
VPR  scheme  described  in  Section  4.2. 


Finally,  Section  7.4  describes  miscellaneous  topics  such  as: 
(1)  a phoneme-specif ic  intelligibility  test,  using  nonsense 
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materials,  which  we  later  decided  was  not  appropriate  except  for 
testing  LPC  systems  which  had  been  implemented  in  real  time, 
which  ours  had  not;  (2)  the  effect  of  lost  packets  on  the 
intelligibility  of  speech  transmitted  over  ARPANET; 
(3)  development  of  an  inventory  of  descriptors  for  different 
perceptual  attributes  of  LPC  vocoder  speech  quality;  and  (4)  an 
attempt  to  reduce  the  effects  of  stimulus  sequence  on  listeners* 
j udgments . 


7.2  Development  of  Method 


7.2.1  Phoneme  Specific  Sentences 


The  development  of  our  Phoneme-Specif ic  test  sentences  grew 
from  the  observation  that  different  vocoders  may  cause  different 
types  of  quality  degradations  within  a single  test  sentence.  For 
example,  one  system  may  degrade  the  nasal  consonants,  and  another 
the  fricatives.  Such  differences  are  a major  cause  of  the 
variability  commonly  found  in  subjective  quality  testing.  If 
such  information  could  be  made  explicit,  it  would  have  important 
diagnostic  implications  for  how  a vocoder  should  be  modified  to 
improve  its  quality. 


Judgments  of  global  quality  are  not  easy  when  the  stimuli 
being  compared  differ  in  a variety  of  ways.  Nor  is  it  easy  to 
compare  speech  samples  with  respect  to  some  particular  property 
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when  they  differ  with  respect  to  many  other  properties  as  well. 
Further,  the  psychometric  literature  is  unequivocal  in  showing 
that  subjects  find  quantity  much  easier  to  judge  than  quality. 
One  way  to  simplify  the  subject's  task  is  to  arrange  that  the 
stimuli  presented  for  judgment  differ  with  respect  to  only  one 
perceptual  dimension  at  a time.  Note  that  this  is  not  the  same 
as  asking  the  subject  to  attend  to  only  one  perceptual  dimension 
at  a time,  when  they  differ  in  other  ways  as  well. 

We  attempted  to  achieve  this  perceptual  effect,  or  something 
close  to  it,  by  analyzing  the  sources  of  distortion  introduced 
into  speech  by  the  LPC  vocoding  process,  and  targetting  each  of 
these  sources  with  one  or  more  sentences  designed  to  maximize  the 
errors  due  to  it,  while  simultaneously  minimizing  the  errors  due 
to  the  other  sources.  Although  our  tests  were  aimed  specifically 
at  LPC  vocoders,  they  are  probably  applicable  to  other  methods  of 
vocoding  as  well.  The  resulting  sentences  are  Phoneme-Specific, 
in  that  they  concentrate  all  phonemes  with  similar  acoustic 
properties  in  a single  sentence.  This  contrasts  with  earlier 
materials,  which  treated  any  sentence  as  equivalent  to  any  other. 
The  equivalent-sentence  paradigm  involves  a logical  inconsistency 
because  it  implicitly  assumes  that  speech  is  homogeneous,  and  at 
the  same  time  denies  this  assumption  in  its  attempt  to  achieve 
phonetic  balance  by  forcing  the  relative  frequency  of  occurrence 
of  phonemes  within  the  test  materials  to  match  those  of  the 
language  at  large. 
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There  are  three  primary  sources  of  distortion  inherent  in 
linear  predictive  vocoders.  The  first  derives  from  the  predictor 
model  itself.  The  linear  predictor  coefficients  effectively 
define  the  spectrum  of  an  all-pole  filter,  which  is  adjusted 
until  it  best  matches  the  envelope  of  the  power  spectrum  of  a 
short  sample  of  input  speech.  Some  speech  sounds,  however,  are 
not  adequately  modelled  by  an  all-pole  spectrum,  since  their 
spectra  contain  zeroes  as  well  as  poles  (although  adequate 
matches  can  be  obtained  if  the  number  of  poles  is  sufficiently 
large)  . Errors  deriving  from  this  source  degrade  phonemes  trhose 
spectra  contain  zeroes,  such  as  nasals  and  fricatives.  The 
second  source  of  distortion  is  in  the  quantization  of  the  LPC 
coefficients  for  transmission.  The  quantization  introduces  some 
inaccuracy  to  the  degree  that  the  spectrum  specified  by  the 
quantized  coefficients  differs  from  that  specified  by  the  same 
coefficients  before  quantization.  Distortions  due  to  this  source 
should  be  most  apparent  when  the  speech  spectrum  is  changing 
relatively  slowly,  as  in  vowels  and  semi-vowels.  Third,  the  time 
interval  defining  the  waveform  sample  is  moved  down  the  waveform 
by  a time  equal  to  the  reciprocal  of  the  frame  rate,  and  the 
spectral  modelling  is  repeated.  The  slower  the  frame  rate,  the 
wider  the  intervals  at  which  the  speech  spectrum  is  sampled,  and 
the  greater  the  chance  that  rapidly  changing  parts  of  the 
waveform  will  be  inadequately  represented.  This  type  of  error 
should  be  most  noticeable  when  a system  with  too  slow  a frame 
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rate  has  to  proceaa  apeech  containing  atopa  and  affricates,  which 
are  characterised  by  rapid  changea  in  both  spectrum  and 
amplitude. 
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A set  of  four  phoneme- specific  sentences  was  selected  from  a 
much  larger  set,  which  appears  complete  in  Appendix  8.  The  four 
phoneme-specific  sentences  were  intended  to  target  the  sources  of 
error  described  above.  Two  additional  "general"  sentences  were 
included,  %diich  contained  many  consonant  clusters  and  unstressed 
syllables.  The  six  sentences  are  as  follows: 


‘ ii 


1.  Why  were  you  away  a year,  Roy? 

2.  Nanny  may  know  my  meaning. 

3.  His  vicious  father  has  seixures. 

4.  Which  tea-party  did  Baker  go  to? 

5.  The  little  blankets  lay  around  on  the  floor. 

6.  The  trouble  with  swimming  is  that  you  can  drown. 


i 

L. 


The  first  four  sentences  include  among  them  all  the 
consonants  of  English,  except  /!/,  /e/,  and  /j/.  The  first  l. 

sentence,  contains  only  vowels  and  glides.  These  sounds  have 
all-pole  spectra,  which  change  slowly,  and  contain  no  abrupt 
changes  in  level.  This  results  from  the  fact  that  these  sounds 
are  produced  with  a relatively  open  vocal  tract,  excited  at  the 
bottom,  and  without  any  shunting  cavities  to  cause  xeroes  (as  in 
/!/)  or  extra  formants.  Furthermore,  all  the  sounds  are  voiced, 
and  only  slow  changes  of  pitch  occur. 
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The  second  sentence  contains  only  (nasalised)  vowels  and 
nasals.  It  is  therefore  also  voiced  throughout,  and  its  spectrun 
and  level  change  relatively  slowly.  Both  the  nasals  and  the 
nasalized  vowels  contain  zeroes  in  their  spectra,  however,  which 
should  create  problems  for  LPC  vocoders  in  the  spectral  matching 
stage. 


Besides  vowels,  the  third  sentence  contains  only  voiced  and 
unvoiced  fricatives.  Fricatives  contain  zeroes  in  their  spectra 
(actually  pole-zero  pairs  which  approximately  cancel  each  other) , 
but  have  spectra  very  different  from  those  of  voiced  sounds,  due 
to  the  noise  excitation.  Rates  of  amplitude  change  are  still 
slow,  since  affricates  were  excluded  from  the  sentence. 

The  fourth  sentence  contains  only  vowels  and  all  the  stops 
and  affricates  except  /j/.  The  spectrum  and  amplitude  of  the 
speech  wave  change  frequently  and  abruptly,  and  there  are  many 
VO iced/ unvoiced  transitions.  This  sentence  should  maximally 
strain  a vocoder's  ability  to  follow  rapid  changes. 

The  last  two  sentences  were  included  as  "general, 
non-diagnostic”  sentences,  partly  to  include  problematical 
clusters  which  would  have  sullied  the  purity  of  the 
phoneme-specific  sentences,  but  also  to  increase  the  number  of 
rapid  unstressed  (and  reduced)  syllables,  which  tend  to  be  less 
clearly  articulated. 
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A second  set  o£  dimensions  along  which  samples  of  speech  can 
vary  concerns  the  idiosyncratic  differences  among  speakers. 
Following  arguments  similar  to  those  above  for  sentence 
materials,  it  is  important  to  represent  a wide  a range  of 
speaker's  physical  (as  opposed  to  dialectal)  characteristics 
rather  than  to  choose  "typical"  speakers.  Therefore,  we  recorded 
t«ienty  talkers,  ten  male  and  ten  female,  reading  each  of  the 
sentences,  and  selected  from  these  three  males  and  three  females 
so  as  to  retain  the  full  range  of  fundamental  frequency  and 
nasality  found  in  the  whole  group.  (Nasality  was  measured  by  an 
accelerometer  mounted  on  the  talker's  nose,  whose  output  was 
compared  in  the  second,  nasal  sentence,  and  the  fourth,  non-nasal 
sentence,  with  overall  levels  equated.)  Talkers  who  spoke 
slowly,  or  had  regional  accents,  were  eliminated. 

7.2.2  Psychophysical  Method 

A variety  of  different  psychophysical  tasks  can  be  used  for 
assessing  subjective  speech  quality.  These  represent  different 
compromises  between  the  complexity  and  duration  of  the  subject's 
task.  The  subjective  task  that  imposes  fewest  constraints  on  the 
listener  is  the  paired  comparison  task.  Pairs  of  stimuli  are 
presented,  and  the  listener  simply  indicates  which  member  of  each 
pair  he  prefers.  Alternatively,  he  may  assign  numbers  to  show 
how  similar  the  two  stimuli  appear,  yielding  similarities  or 
proximity  data.  Since  only  two  stimuli  are  presented  at  a time. 
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the  listener  never  has  to  explicitly  resolve  the  problem  that  the 
members  of  successive  pairs  may  differ  in  different  ways. 
Unfortunately,  the  number  of  paired  comparisons  that  has  to  be 
made  increases  with  the  square  of  the  number  of  stimuli,  so  that 
the  exhaustive  procedure  becomes  unmanageable  when  there  are  more 
than  15  to  20  stimuli  to  be  compared,  although  various  seuDpling 
schemes  are  available.  Thus,  paired  comparisons  are  easy  but 
tedious. 

An  alternative  is  a ranking  task,  in  which  subjects  are 
given  several  stimulus  sentences,  and  have  to  rank  order  them 
according  to  quality.  With  conventional  materials  this  is  a very 
difficult  task  that  generates  much  variability,  since  the  subject 
must  decide  how  to  trade  off  one  sort  of  degradation  against 
another,  and  apply  that  trade-off  consistently.  When  the  speech 
varies  along  several  perceptual  dimensions,  maintaining 
consistency  with  respect  to  each  of  the  required  trade-offs 
becomes  impossible.  The  foregoing  difficulties  can  be  reduced  by 
using  the  Phoneme- Specific  sentence  materials  described  above, 
and  presenting  stimuli  for  ranking  that  consist  of  only  a single 
sentence  processed  by  all  the  vocoder  systems  to  be  compared.  As 
compared  with  the  paired-comparison  task,  this  reduces  the  amount 
of  data  to  be  collected,  at  the  expense  of  making  the  listeners 
task  slightly  more  difficult,  and  introducing  the  risk  of  reduced 
reliability.  At  the  same  time,  the  ranking  task  retains  some  of 
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the  desirable  features  of  the  paired  comparison  task.  Since  a 
stimulus  can  be  listened  to  repeatedly,  the  subject  can  build  up 
his  rank  order  by  starting  with  a pair,  then  placing  the  third 
stimulus  in  the  correct  ranking  with  respect  to  the  preceding 
two,  and  so  on.  Thus,  new  stimuli  may  be  added  by  a series  of 
paired  comparisons  with  the  members  of  the  existing  rank  order. 
The  rank-order  procedure  reduces  the  number  of  times  each 
stimulus  must  be  presented  to  perhaps  five  or  ten  per  stimulus. 

A complication  of  the  ranking  task  results  from  the  fact 
that  the  range  of  qualities  encountered  within  a single  test 
sentence  as  processed  by  several  vocoders  may  be  very  different 
from  the  range  for  a second  test  sentence.  Thus  there  may  be  a 
considerable  range  of  qualities  associated  with  the  lowest  rank, 
but  there  is  no  way  for  the  subject  to  express  this,  although  he 
might  be  willing  and  able  to  do  so  if  given  the  chance.  The 
ranking  task  becomes  more  difficult,  and  the  data  from  it  become 
less  reliable,  as  the  number  of  systems  to  be  ranked  increases. 
The  method  is  probably  not  appropriate  when  more  than  20  systems 
are  to  be  compared. 

A rating  task  avoids  some  of  these  problems,  and  is  perhaps 
the  most  efficient  task  possible,  in  that  it  requires  only  a 
single  presentation  of  each  stimulus.  In  practice,  several 
presentations  are  often  used,  to  improve  reliability.  At  the 
same  time  this  task  requires  most  from  the  subject,  who  must 
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assign  numbers  that  reflect  his  perception  of  quality,  and  stick 
to  the  same  rating  system  through  the  whole  experiment.  His 
criterion  may  drift  during  an  experiment  lasting  an  hour  or  more, 
and  it  is  difficult  to  assess  how  much  drift  has  occurred,  and  to 
correct  for  it. 

An  important  question  is  whether  the  tasks  outlined  above 
(and  other  possible  tasks  too)  force  subjects  to  perceive  the 
stimuli  differently,  or  whether  each  subject  makes  use  of  a 
single  underlying  perceptual  structure  to  perform  all  tasks.  If 
the  latter  could  be  demonstrated,  it  would  allow  the  task  to  be 
selected  on  the  basis  of  convenience  for  any  given  application. 

7.2.3  Multidimensional  Scaling  and  Analysis 

It  should  be  clear  from  the  arguments  above  that  speech 
quality  can  vary  along  several  perceptual  dimensions 
simultaneously,  and  that  these  may  be  separable,  especially  if 
phoneme-specif ic  sentences  are  used  as  test  materials. 
Furthermore,  diagnostic  information  about  different  aspects  of 
quality  can  be  derived  from  such  data.  Yet  most  approaches  to 
quality  assessment  start  with  the  assumption  that  quality  is  a 
unidimensional  variable,  thus  ignoring  the  diagnostic  potential. 
Unidimensional  testing  also  introduces  a major  source  of 
inter-subject  variability,  since  it  requires  the  subject  to 
collapse  his  multidimensional  percepts  onto  a single  dimension  to 
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arrive  at  a response,  and  different  subjects  may  weight  the 
various  perceptual  dimensions  quite  differently. 


A major  justification  for  treating  quality  as  a 
unidimensional  variable  is  that  it  will  be  used  for  choosing  the 
best  of  a set  of  candidate  systems,  an  inherently  one-dimensional 
task.  But  counterexamples  can  easily  be  found:  a vocoder  that 
yields  excellent  quality  for  female  voices,  but  fails 
disastrously  on  male  voices  will  not  receive  a high  quality 
rating  on  a unidimensional  scale.  Yet  it  may  be  ideally  suited 
for  an  application  in  which  only  females  will  use  it,  a fact  a 
unidimensional  test  «rould  not  discover.  It  would  seem  to  be 


better  to  recognize  that  quality  is  multidimensional,  and  collect 
and  analyze  data  accordingly,  and  only  later  collapse  the 
multidimensional  result  onto  a unidimensional  scale  if  desired. 


Among  other  things,  this  would  permit  the  tester  to  decide  how  to 
combine  the  various  perceptual  dimensions,  rather  than  be  forced 
to  accept  the  idiosyncratic  combinations  adopted  by  the  subjects 
in  a unidimensional  task. 


Multidimensional  scaling  (MDS)  methods  attempt  to 'model 
empirical  data  by  representing  each  stimulus,  or  vocoder  system, 
as  a point  in  an  n-dimensional  space,  such  that  the  data 
reconstructed  from  the  model  match  the  empirical  data  as  closely 
as  possible.  There  are  several  classes  of  models,  which  are 
hierarchically  related  in  that  each  class  is  a special  case  of 
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the  next-higher  class  in  the  hierarchy.  The  simplest  is  the 
vector  model.  The  stimuli  (here  the  different  vocoders)  are 
represented  by  points  in  an  n-dimensional  space , and  each 
condition  under  which  data  are  collected  (different  sentences  or 
subjects)  can  be  represented  by  a vector  through  the  space.  The 
data  are  represented  by  the  ordering  and  relative  spacing  of  the 
stimulus  points  as  projected  onto  the  appropriate  vector  . An 
example  of  a vector  model  appropiate  to  scaling  preference  data 
is  HDPREF  [16] . 

A second  type  of  model  appropriate  to  speech  quality 
assessment  if.  the  weighted  Euclidean  model,  typified  by  INDSCAL 
[17].  INDSCAL  was  developed  to  model  explicitly  the  large 
individual  differences  in  how  subjects  perceive  stimuli.  The 
model  assumes  that  all  subjects  use  the  same  set  of  underlying 
perceptual  dimensions,  but  that  the  relative  salience  of  these 
dimensions  varies  among  subjects.  Therefore,  INDSCAL  models  each 
stimulus  as  a point  in  a "group  space"  of  one  or  more  dimensions, 
which  represent  the  perceptual  dimensions  common  to  all  subjects. 
To  model  the  data  for  a particular  subject,  the  dimensions  of  the 
group  space  are  linearly  stretched  or  shrunk  until  they  reflect 
the  relative  salience  of  the  dimensions  for  that  subject.  Thus 
the  INDSCAL  solution  consists  of  sets  of  coordinates  for  the 
stimuli  in  the  group  space,  and  sets  of  weights  for  deforming  the 
group  space  to  produce  the  idiosyncratic  space  for  each  subject. 
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The  distance  between  stimulus  points  in  the  space  represents 
their  similarity:  stimuli  that  are  judged  very  similar  are 
represented  by  points  that  are  very  close  together  in  the  space. 

The  multidimensional  space  that  is  used  to  model  the  data  in 
these  examples  is  a perceptual,  or  subjective  space.  The 
analysis  itself  does  not  identify  the  factors  represented  by  the 
coordinate  axes  of  the  space,  which  are  simply  those  that  give 
the  best  match  to  the  input  data.  Often,  but  not  always,  the 
axes  can  be  identified  from  the  way  the  stimuli  are  distributed 
in  the  space.  Otherwise,  several  additional  psychophysical 
experiments  may  be  required  to  identify  the  axes.  Even  this  does 
not  guarantee  that  the  axes  will  be  identified:  sometimes  no 
objective  properties  of  the  stimuli  can  be  found  that  correspond 
to  particular  subjective  dimensions.  Unfortunately,  these 
shortcomings  reduce  the  usefulness  of  multidimensional  scaling 
for  routine  quality  testing,  although  they  can  be  highly 
beneficial  in  development  work. 

Preference  data,  such  as  is  generated  by  the  rating  or 
ranking  tasks  described  above,  can  be  analyzed  by  vector  models 
such  as  MDPREF,  or  by  weighted  Euclidean  models  such  as  INDSCAL 
if  the  data  are  first  converted  into  proximities  [17]. 
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7.2^4  A Test  of  the  Method  Using  2600  bps  Systems 

To  try  out  our  method,  we  selected  a set  of  12  LPC  vocoder 
systems,  whose  bit  rates  were  equated  to  2600  bps  to  test  the 
method's  ability  to  discriminate  small  differences  of  quality. 
Each  of  the  36  test  sentences  (6  phoneme-specif ic  sentences  x 6 
talkers)  was  low-pass  filtered  at  5 kHz  and  processed  through 
each  system.  Each  system  used  9,  11,  or  13  poles;  and 
inter-frame  intervals  (reciprocal  of  frame  rate)  were  25,  20,  or 
15  ms,  or  variable  based  on  data  analyzed  every  10  ms.  Details 
of  the  parameter  combinations  for  each  system  appear  under 
Fig.  7.1.  In  the  five  variable  rate  systems,  frames  of  spectral 
data  were  analyzed  every  10  ms,  but  each  frame  was  transmitted 
only  if  the  spectral  difference  between  it  and  the  previous 
transmitted  frame  exceeded  a threshold.  The  quantization  step 
size  of  the  LARs,  and,  for  the  VFR  systems,  the  threshold,  were 
adjusted  so  that  the  overall  bit  rate  of  all  systems  was  equated 
at  2600  bps,  averaged  over  the  36  test  sentences.  Quantization 
step  size  varied  between  0.2  dB  and  1.75  dB,  and  the  VFR 
thresholds  varied  between  1.0  and  1.75  dB,  yielding  average  frame 
rates  between  47  and  31  per  second.  Pitch  and  gain  were  coded  in 
6 and  5 bits  respectively,  and  were  transmitted  at  the  same  frame 
rate  as  the  coefficients  for  the  fixed-rate  systems,  but  at  a 
constant  rate  of  50  fps  for  the  VFR  systems,  to  avoid  confounding 
excitation  and  spectral  variables.  One  final  vocoder  used  13 
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poles,  with  unquantized  coefficients,  and  an  inter-frame  interval 
of  10  ms.  Finally,  the  digitized  but  unprocessed  waveforms  were 
included  to  act  as  undegraded  anchors.  The  unprocessed  speech 
was  effectively  110  kbps  PCM. 

The  same  four  subjects  served  in  two  judgment  tasks,  one  a 
ranking  task  and  the  other  a rating  task.  Our  purpose  in 
collecting  data  with  two  different  psychophysical  methods  was  to 
test  the  idea  that  any  judgments  required  of  a subject  are  made 
on  the  basis  of  a single  underlying  perceptual  structure,  or  set 
of  psychological  dimensions.  If  both  tasks  give  similar  results, 
this  idea  is  supported,  and  the  most  efficient  task  may  then  be 
selected  for  subsequent  experiments. 

In  total,  there  were  504  stimulus  sentences — 36  test 
sentences  x 14  systems  (12  vocoders  with  quantized  coefficients, 
1 with  unquantized  coefficients,  and  1 PCH) . For  the 
rank-ordering  task,  these  were  transferred  to  Bell  and  Howell 
Language  Master  cards,  to  permit  random  access.  Each  subject 
rank  ordered  the  14  versions  of  a given  test  sentence,  separately 
for  each  of  the  36  sentence-speaker  combinations,  which  were 
arranged  in  a different  counterbalanced  order  for  each  subject. 
The  task  was  self-paced,  and  took  a total  of  6 to  9 hours,  spread 
over  several  days. 


i 

t 
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For  the  rating  task,  the  504  stimulus  sentences  %rere 


recorded  into  a carefully  counterbalanced  order,  in  which  each 
sentence,  speaker,  and  system  followed  every  other  sentence, 
speaker,  and  system  (except  itself)  as  nearly  the  same  number  of 
times  as  possible.  Stimuli  were  presented  in  blocks  of  10;  the 
first  stimulus  in  each  block  repeated  the  last  stimulus  in  the 


preceding  block,  and  was  not  scored.  Also,  unbeknownst  to  the 
subjects,  the  first  10  blocks  of  stimuli  were  repeated  at  the 
end,  thus  permitting  an  assessment  of  consistency  and  drift. 
Consistency  was  high  and  drift  was  negligible.  The  four  subjects 
had  also  served  in  the  ranking  task;  they  assigned  "degradation 
ratings"  to  each  stimulus,  with  higher  numbers  representing  more 


degradation  (lower  quality).  Subjects  were  told  to  assign  zero 
degradation  to  any  undegraded  sentences  they  heard,  and  to  try  to 
assign  ratings  on  a proportional  basis,  with  twice  as  large  a 
number  representing  twice  the  degradation,  as  in  a magnitude 
estimation  task  with  a natural  zero.  Since  the  first  few 


judgments  from  each  subject  effectively  determined  his  step  size, 
each  subject's  ratings  were  later  normalized  by  dividing  through 
by  his  mean  rating.  The  rating  task  took  just  over  an  hour. 


Results 


The  data  from  each  task  were  pooled  across  subjects,  and 


analyzed  separately  with  MDPREF  [16].  The  first  three  dimensions 
accounted  for  70.4%,  8.9%,  and  6.0%  respectively  of  the  variance 
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in  the  rating  data,  and  65.81,  11.9%,  and  7.6%  of  the  variance  in 


the  rank  data,  in  each  case,  the  fourth  dimension  accounted  for 


only  an  additional  3%  of  the  variance.  Canonical  correlation 


[18]  of  the  two  solutions  showed  them  to  be  almost  identical 


with  the  first  three  (orthogonal)  linear  composites  correlating 
at  0.988,  0.930,  and  0.758  respectively.  The  first  two  of  these 
are  significant  well  beyond  P<.001  and  the  third  is  significant 
at  P<0.01  (chi>square  > 69.6,  with  9 df;  30.0,  with  4 df;  and 


8.98,  with  1 df) . The  conclusion  that  rating  and  ranking  tasks 
produced  virtually  identical  results  seems  justified,  which  means 


that  the  more  efficient  task  (rating)  can  be  used  in  future 


assessments 


Figures  7.1  and  7.2  show  the  distribution  of  the  vocoder 


systems  in  the  3>dimensional  solution  space,  as  projected  onto 


respectively.  Each  test  sentence  on  which  ratings  were  obtained 
would  be  represented  by  a vector  through  the  space,  but  they  are 
not  shown,  to  avoid  cluttering  the  figure  (more  detail  can  be 
found  in  [19]).  The  relative  performance  of  two  vocoders  on  a 
particular  speaker-sentence  combination  is  represented  by  the 
relative  positions  of  the  projections  of  the  points  representing 
the  systems  onto  the  corresponding  vector. 


The  results  show  a clear  separation  of  the  systems  as  a 
function  of  1)  the  number  of  poles,  and  2)  the  inter-frame 
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interval r of  the  vocoders.  Furthermore,  the  separation  alonfl 
these  two  dimensions  was  orthogonal,  suggesting  that  the 
perceptual  effect  of  changing  the  number  of  poles  ("static* 
spectral  accuracy)  was  independent  of  the  perceptual  effect  of 
changing  the  inter-frame  interval  ("dynamic"  spectral  accuracy) . 
The  orientation  of  the  test-sentence  vectors  in  the  space  showed 
that  the  separation  of  the  fixed-rate  systems  by  inter-frame 
interval  was  achieved  as  a result  of  the  specially  composed 
sentence  materials,  with  the  short  inter-frame  interval  systems 
performing  better  on  the  rapidly  changing  sentence  (No.  4:  Which 
tea  party...),  and  the  long  inter-frame  interval  systems,  with 
more  bits  per  frame,  doing  better  on  the  slowly-varying  sentences 
(Nos.  1 and  2) . The  VFR  systems  were  located  correctly  for  their 
inter-frame  intervals  of  10  ms,  but  performed  unexpectedly  badly 
on  the  slow-moving  sentences.  Nos.  1 and  2.  Separation  of  the 
vocoders  as  a function  of  the  number  of  poles  resulted  from  the 
use  of  the  different  talkers,  with  the  relative  performance  of 
systems  with  13,  11,  and  9 poles  on  a particular  sentence  being 
highly  correlated  with  the  mean  fundamental  frequency  of  the 
talker  in  that  sentence.  Nine-pole  systems  performed  almost  as 
well  as  11-  or  13-pole  systems  on  high-pitched  talkers,  but  not 
on  low-pitched  talkers. 
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Concluaiona 

1)  Phoneme- apeclflc  aentencea»  when  uaed  in  a rating  taak  to 

aaaeaa  the  aubjective  quality  of  a aet  of  twelve  very  aimilar 
LPC  vocodera,  were  able  to  dlatinguiah  quite  email 

differencea  in  quality.  The  data  were  both  reliable  and 
diagnoatically  uaeful,  in  that  they  permitted  the  particular 
parameter  cauaing  degradation  to  be  identified. 

2)  Virtually  identical  MDPRBF  aolutiona  were  obtained  for  rating 
and  rank-ordering  taaka»  which  atrongly  aupporta  the  idea 
that  aubjecta  uaed  the  aame  aet  of  perceptual  dimenaiona  when 
reaponding  to  vocoder-proceaaed  apeech  aamplea,  for  both  of 
theae  taaka.  Thia  meana  that  the  moat  coat-effective  taak 
can  be  uaed  excluaively  — in  thia  caae  the  rating  taak. 

7.3  Rpplicationa  of  the  Method 

7.3.1  Bffecta  of  Vocoder  Parameter a on  Quality 

A factorial  aubjectlve-quality  atudy  waa  performed  to 
meaaure  how  the  quality  of  LPC  vocoded  apeech  ia  affected  by 
three  different  methoda  of  reducing  bit  rate.  A paper  on  thia 
atudy  waa  presented  at  the  1977  ICASSP  Conference  at  Hartford, 
Conn.,  and  is  reproduced  as  Appendix  9.  The  three  methods  of 


reducing  bit  rate  wares 
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1)  reducing  the  number  of  poles  (P)  used  for  spectral  matching, 

2)  coarsening  quantisation  step  size  (Q)  for  the  LAR 
coefficients, 

3)  reducing  the  frame  transmission  rate  (R) . 

The  set  of  spectral  parameter  values  that  were  used  are  shown 
below,  together  with  the  number  of  bits  per  frame. 


Quantization 

NO 

. of 

Poles 

» P 

Step  Size,  Q 

13 

11 

9 

8 

0.25  dB 

76 

0.5 

dB 

63 

55 

47 

43 

1.0 

dB 

50 

44 

38 

35 

2.0 

dB 

37 

33 

29 

27 

Bits  per  frame,  excluding  pitch  and  gain,  for  all  combinations  of 
number  of  poles  and  quantization  step  size  used  in  the  present 
study. 

Bach  combination  of  spectral  parameters  (except  13  poles  with 
0.25  dB  quantization)  was  combined  with  four  different  fixed 
transmission  rates  of  R ■ 100,  67,  50,  and  33  fps,  yielding  48 
LPC  systems  (4x3x4).  Two  additional  systems  were  included:  an 
LPC  system  with  13  poles,  quantization  step  size  of  0.25  dB,  and 
transmission  rate  of  100  fps;  and  PCM  speech  at  110  kbps  (i.e. 
the  5 kHz  bandwidth  speech  sampled  at  10  kHz  and  quantized  to  11 
bits),  to  act  as  an  undegraded  anchor.  Pitch  and  gain  were  coded 
in  6 and  5 bits  respectively,  and  transmitted  at  the  same  frame 
rate  as  the  coefficients.  The  measured  overall  bit  rates  of  the 
LPC  systems  ranged  from  8430  bps  (P  ■ 13,  Q ■ 0.25  dB,  R ■ 100 
fps) , down  to  1225  bps  (P  ■ 8,  Q > 2.0  dB,  R > 33  fps) , as  shown 
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in  Table  7.1.  (These  rates  do  not  include  the  benefits  of 
Huffman  coding.) 

Our  earlier  subjective  quality  tests  showed  the  necessity  of 
passing  all  sentence  materials  through  all  systems. 
Unfortunately,  we  could  not  use  all  36  speaker-sentence 
combinations  in  the  present  study,  since  passing  them  through  all 
50  vocoder  systems  would  have  made  the  study  unmanageably  large. 
We  therefore  selected  a subset  of  seven  speaker-sentence 
combinations,  and  confirmed  that  they  were  adequately 
representative  of  the  full  set  by  showing  that  the  MDPREF 
solution  obtained  from  the  data  from  the  subset  was  substantially 
the  same  as  that  obtained  from  the  complete  set.  (Canonical 
correlations  between  the  first  three  linear  composites  of  the  two 
solutions  were  0.991,  0.954,  and  0.923.)  The  selected  sentence 
tokens  were:  JBl,  DD2,  RS3,  AR4,  JB5,  DK6,  and  RS6  (the  initials 
identify  the  speaker  and  the  number  identifies  the  sentence) . 
Average  fundamental  frequency  is  shown  for  each  test  sentence  in 
the  second  row  of  Table  7.1. 

Each  of  the  seven  input  sentences  was  low-passed  at  5 kHz, 
digitized  (11  bits,  10  kHz),  and  passed  through  each  of  the  50 
simulated  vocoder  systems,  to  yield  a total  of  350  different 
stimulus  items.  A counterbalanced  presentation  sequence  was 
generated,  in  which  each  of  the  50  systems  followed  every  other 
system  once,  and  each  speaker  and  sentence  followed  each  other 
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53.7 

66.6 

63.84 

142 

13 

2.8 

67 

3.288 

71.8 

63.5 

78.5 

61.2 

55.6 

59.3 

59.7 

62.97 

143 

13 

2.8 

58 

2.468 

71.8 

68.2 

72.8 

68.8 

47.1 

43.8 

64.6 

68.46 

144 

13 

2.8 

33 

1.638 

59.9 

64.1 

75.1 

78.6 

52.4 

44.2 

66.8 

61.78 

221 

11 

8.5 

188 

6.438 

56.3 

58.4 

54.9 

58.6 

29.6 

37.7 

54.8 

47.65 

222 

11 

8.5 

67 

4.288 

51.8 

48.3 

67.2 

57.6 

39.9 

38.2 

52.1 

49.58 

223 

11 

8.5 

58 

3.218 

55.8 

52.4 

68.9 

63.1 

38.6 

34.9 

59.4 

53.19 

224 

n 

8.5 

33 

2.138 

49.8 

55.1 

73.4 

65.8 

48.4 

41.8 

66.8 

56.86 

231 

11 

1.8 

188 

5.368 

68.5 

48.9 

68.9 

53.8 

35.4 

32.5 

55.2 

49.68 

232 

11 

1.8 

67 

3.567 

52.2 

53.9 

62.8 

56.6 

43.7 

42.5 

53.8 

51.97 

233 

11 

1.8 

58 

2.676 

51.4 

49.2 

72.1 

62.3 

41.7 

48.2 

55.8 

54.27 

234 

11 

1.8 

33 

1.781 

53.5 

56.6 

72.8 

69.8 

58.1 

41.8 

63.7 

58.21 

241 

11 

2.8 

188 

4.528 

71.7 

63.9 

62.4 

59.4 

49.2 

48.6 

63.8 

59.75 

242 

11 

2.8 

67 

2.968 

69.9 

59.7 

71.9 

62.9 

49.9 

53.2 

59.4 

68.99 

243 

11 

2.8 

58 

2.226 

68.2 

58.8 

69.3 

61.7 

44.4 

42.1 

63.1 

58.14 

244 

11 

2.8 

33 

1.481 

67.5 

67.9 

74.4 

69.2 

68.1 

44.3 

78.9 

64.89 

321 

9 

8.5 

188 

5.638 

66.8 

58.8 

58.5 

53.7 

46.4 

57.8 

52.5 

56.24 

322 

9 

8.5 

67 

3.747 

68.4 

53.9 

68.4 

62.3 

59.1 

57.6 

58.3 

68.81 

323 

9 

8.5 

58 

2.818 

67.8 

57.8 

74.1 

61.1 

52.6 

64.4 

56.5 

61.82 

324 

9 

8.5 

33 

1.871 

78.3 

64.6 

75.8 

78.9 

66.4 

69.9 

65.7 

68.95 

331 

9 

1.8 

188 

4.768 

72.8 

61.4 

59.5 

57.8 

51.1 

57.9 

58.5 

59.75 

332 

9 

1.8 

67 

3.168 

61.7 

59.6 

66.4 

68.6 

52.9 

61.5 

54.7 

59.63 

333 

9 

1.8 

58 

2.376 

74.5 

62.2 

69.4 

59.8 

56.5 

59.7 

61.2 

63.22 

334 

9 

1.8 

33 

1.581 

69.9 

68.8 

76.2 

68.9 

69.6 

69.6 

73.9 

78.97 

341 

9 

2.8 

188 

3.968 

76.1 

73.4 

67.6 

68.7 

57.2 

63.6 

68.3 

65.56 

342 

9 

2.8 

67 

2.634 

75.4 

72.1 

78.8 

67.8 

56.6 

64.8 

61.1 

66.72 

343 

9 

2.8 

58 

1.976 

72.1 

74.7 

72.7 

69.9 

57.8 

63.4 

63.5 

67.62 

344 

9 

2.8 

33 

1.315 

71.4 

75.3 

74.3 

68.1 

.71.6 

64.4 

78.6 

78.83 

421 

8 

8.5 

188 

5.168 

79.8 

' 59.9 

56.9 

56.7 

63.9 

76.2 

54.6 

63.86 

422 

8 

8.5 

67 

3.434 

88.4 

68.7 

64.4 

62.7 

66.6 

75.7 

55.7 

67.76 

423 

8 

8.5 

58 

2.534 

79.3 

65.9 

71.8 

63.4 

68.4 

74.4 

62.7 

69.41 

424 

8 

8.5 

33 

1.715 

81.6 

69.9 

78.8 

71.7 

74.7 

76.9 

66.9 

73.87 

431 

8 

1.8 

188 

4.461 

77.9 

63.5 

61.1 

56.2 

63.9 

69.4 

59.4 

64.48 

432 

8 

1.8 

67 

2.968 

76.6 

67.8 

68.3 

63.8 

66.8 

78.8 

53.8 

67.52 

433 

8 

1.8 

58 

2.226 

76.8 

61.7 

69.9 

62.7 

65.6 

76.9 

59.8 

67.38 

434 

8 

1.8 

33 

1.481 

88.8 

72.9 

76.2 

78.7 

75.6 

77.4 

71.7 

74.92 

441 

8 

2.8 

188 

3.691 

81.4 

64.8 

69.2 

66.9 

67.8 

75.6 

64.7 

69.85 

442 

8 

2.8 

67 

2.456 

88.4 

72.5 

71.7 

68.9 

65.6 

77.9 

68.7 

71.18 

443 

8 

2.8 

58 

1.848 

78.2 

66.9 

74.8 

68.9 

68.6 

77.8 

71.8 

72.19 

444 

8 

2.8 

33 

1.225 

78.8 

71.1 

76.9 

71.9 

79.5 

82.6 

69.4 

75.63 

Table  7.1  System  ID's  and  parameters,  together  with  mean 

degradation  rating  on  each  of  the  seven  test  sentences 
(see  heading),  and  all  seven  sentences  pooled. 

(See  text  for  more  details.) 
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speaker  or  other  sentence  with  about  the  same  frequency.  No 
system  and  no  sentence  followed  itself. 


In  addition  to  counterbalancing  the  presentation  sequence, 
we  tried  to  further  reduce  sequence  effects,  and  thus  improve  the 
reliability  of  the  data,  by  fading  in  and  out  a continuous  speech 
babble  at  the  same  level  as  the  speech,  during  each 


inter-stimulus  interval.  (This  method  is  described  further 
below,  in  Section  7.4.4.)  Seven  experimental  tapes  were 
I recorded.  Stimuli  were  presented  in  blocks  of  ten,  at  a rate  of 

I one  every  7.5  seconds,  with  a longer  gap  between  blocks.  The 

'I  subject's  task  was  to  rate  the  degradation  of  the  stimuli  he 

heard.  This  negative  attribute  was  chosen  for  scaling,  because 
the  scale  has  a natural  origin,  or  zero,  corresponding  to 
undegraded  speech.  Degradation  ratings  ranged  between  0 and  100, 
with  small  numbers  corresponding  to  high  quality,  and  large 
numbers  to  poor  quality.  Nine  normal  hearing  subjects  served  in 
the  experiment.  All  of  the  subjects  made  the  first  two  passes 
through  the  350  stimuli,  and  three  of  them  made  a further  three 
passes  each. 

Results 

First,  to  check  on  the  reliability  of  the  data,  the 
responses  collected  on  each  pair  of  passes  through  the  350 
stimuli  were  correlated,  for  each  subject.  All  correlations  were 


significant,  all  but  three  well  beyond  p<.001.  Therefore, 
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although  there  was  some  variability  between  subjects,  all  the 
subjects  apparently  gave  highly  reliable  data. 

The  mean  degradation  rating  was  calculated  for  each  system, 
both  by  sentence,  and  pooled  across  all  seven  sentences.  The 
mean  ratings  are  shown  in  Table  7.1,  and  the  pooled  means  are 
plotted  in  Fig.  7.3.  Bach  system  is  identified  by  three  digits, 
corresponding  to  its  parameter  level  for  P,  Q,  and  R, 
respectively.  Thus  system  231  used  level  2 of  p (11  poles) , 
level  3 of  Q (1.0  dB)  and  level  1 of  R (100  fps) , as  shown  in  the 
key  to  the  figure.  The  110  kbps  PCN  speech,  used  as  undegraded 
anchor,  is  labelled  "000."  The  mean  ratings  (N.B.  not  the  raw 
ratings)  have  standard  deviations  ranging  between  1.0  and  1.7 
degradation  points.  Any  difference  between  two  plotted  means 
that  is  larger  than  about  4-5  points  is  likely  to  be  significant 
at  P<0.05,  and  some  much  smaller  differences  were  also 
significant.  (The  results  of  t-tests  between  each  pair  of 
systems  are  described  in  (19].) 

Fig.  7.3  shows  the  effects  on  degradation  of  decreasing  bit 
rate  byt  a)  reducing  the  number  of  poles  (top);  b)  coarsening 
the  quantization  step  size  (middle) ; and  c)  decreasing  the  frame 
rate  (bottom) . In  each  case,  the  two  remaining  parameters  are 
held  constant:  each  line  represents  a family  of  vocoders  that 
differ  in  only  one  parameter.  Comparing  the  slopes  of  the  lines 
in  the  three  parts  of  the  figure  shows  dramatically  that  reducing 
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Pig.  7.3  Degradation  rating  vs.  overall  bit  rate  for  49  vocoders. 
The  effect  on  degradation  of  changing  the  number  of 
poles  (top  panel) ; the  quantization  step  size  (middle 
panel);  and  the  frame  rate  (bottom  panel).  See  text 
for  details. 
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the  frame  rate  (Fig.  7.3f  bottom)  yields  the  largest  savings  of 
bit  rate  for  the  smallest  loss  of  quality,  and  that  for  many  of 
the  systems  the  loss  of  quality  shows  no  knee,  even  at  the  lowest 
frame  rate.  The  flatness  of  these  lines  justifies  our  enthusiasm 
for  variable  frame  rate  systems,  whose  superiority  we  document 
further  in  later  sections. 

Secondly,  inspection  of  Fig.  7.3,  top,  shows  that  the  rate 
of  quality  loss  per  bit  saved  is  most  severe  for  savings  gained 
by  reducing  the  number  of  poles.  There  is  a sharp  knee  in  most 
of  the  functions  at  11  poles  — it  is  unfortunate  that  we  did  not 
also  include  10  poles,  although  our  other  work  suggests  that  11 
poles  is  in  fact  the  lowest  number  that  yields  good  quality  with 
male  voices,  with  a 5 kHz  speech  bandwidth. 

7.3.2  Speech  Quality  Testing  of  Some  VFR  Vocoders 

VFR  transmission  of  LPC  vocoder  coefficients  is  a technique 
for  reducing  the  average  transmission  rate  without  appreciable 
loss  of  quality  (see  Section  4).  The  technique  transmits 
parameters  at  a variable  rate  in  accordance  with  the  changing 
characteristics  of  the  speech  signal.  To  demonstrate  the 
soundness  of  the  rationale  for  VFR  transmission,  an  experiment 
was  performed  to  compare  VFR  with  two  other  methods  for  reducing 
the  bit  rate:  (a)  reducing  the  number  of  poles,  and 
(b)  Increasing  the  quantization  step  size  of  the  lAR 
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coefficients.  The  VFR  scheme  tested  used  a transmission  decision 
based  on  the  log  likelihood  ratio,  with  a single  threshold,  as 
described  in  Section  4.1.  (Our  superior  perceptual-model  based 
scheme,  whose  testing  is  described  in  Section  7.3.3,  was  a later 
development.)  Thirty-two  stimulus  sentences  were  prepared  by 
passing  four  utterances  (2  sentences  x 2 speakers)  through  eight 
vocoder  systems.  The  vocoders  were  specified  by  a 2 x 2 x 2 
factorial  design;  two  values  were  assigned  to  each  of  the  three 
parameters:  average  frame  rate,  number  of  poles,  and 
quantization  step  size.  Eight  listeners  made  7-point  category 
ratings  of  quality  degradation.  The  results  of  the  experiment 
show  that,  of  the  three  methods  studied,  the  VFR  technique 
produced  the  highest  quality  at  any  given  transmission  rate  (or, 
equivalently,  yielded  the  lowest  bit  rate  for  a fixed  level  of 
speech  quality).  The  results  of  this  study  have  been  published, 
and  the  published  paper  is  reproduced  as  Appendix  10. 

The  present  study  had  the  explicit  aim  of  comparing  systems 
that  differed  along  three  dimensions.  We  adopted  a factorial 
design,  in  which  two  values  of  each  of  the  three  parameters 
occurred  in  every  possible  combination.  The  resulting  systems 
produce  a wide  range  of  qualities.  Each  system  used  either  11  or 
8 poles.  The  LAR  coefficients  were  quantized  in  steps  of  either 
0.5  dB  or  2.0  dB.  LPC  analysis  of  the  speech  signal  was  carried 
out  at  50  fps,  and  the  log  likelihood  ratio  threshold  of  the  VFR 
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scheme  was  set  to  either  zero  dB,  in  which  case  every  analyzed 
frame  was  transmitted,  yielding  a fixed  frame  rate  of  50  per 
second,  or  2.5  dB,  which  resulted  in  a variable  frame  rate  that 
averaged  23.3  per  second.  Note  that  2.5  dB  represents  a very 
coarse  threshold,  and  that  the  resulting  average  frame  rate  is 
less  than  60%  of  the  average  frame  rate  of  the  VFR  systems  in  the 
study  reported  above  (Section  7.2).  Pitch  and  gain  were  coded  in 
6 and  5 bits  respectively,  and  transmitted  at  a constant  rate  of 
50  fps  for  all  8 vocoder  systems. 

A subset  of  the  thirty-six  test  sentences  used  in  the  first 
study  was  selected.  To  ensure  that  the  subset  was  representative 
of  the  whole  set  of  36,  we  chose  the  two  "general"  sentences 
(i.e.  Nos  5 and  6),  since  between  them  these  contain  most  of  the 
English  phonemes.  Two  speakers  were  then  selected,  one  male  and 
one  female,  such  that  the  vectors  corresponding  to  their 
productions  of  the  two  general  sentences  were  separated  as  widely 
as  possible  in  the  NDPREF  solution  space  of  the  earlier  study. 
To  confirm  that  these  four  stimulus  sentences  were  adequately 
representative,  we  repeated  the  NDPREF  analysis  of  the  earlier 
study,  using  only  the  subset  of  data  collected  on  the  four 
sentences.  The  solution  obtained  was  similar  to  the  solution 
obtained  with  the  whole  set  of  36  sentences,  and  achieved  the 
same  orthogonal  separation  of  the  systems  by  number  of  poles,  and 
by  frame  rate.  (Canonical  correlations  between  the  first  three 
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linear  composites  for  the  two  solutions  were  0.978,  8.915,  and 

0.428.)  This  test  confirmed  that  the  selected  subset  was  indeed 
representative.  j 

The  four  sentences  were  passed  through  the  eight  simulated 
vocoders,  and  were  recorded  in  two  random  orders  on  the  stimulus 
tape,  with  order  of  sequential  presentation  counterbalanced  fully 
across  system  pairs,  and  as  far  as  possible  across  sentence 
pairs,  with  the  constraint  that  no  system  and  no  sentence  should 
follow  itself.  Bight  subjects  were  then  run  individually  through  | 

two  exact  repetitions  of  the  tape  — although  the  subjects  %fere  1 

1 i 

not  aware  of  the  repetition.  Thus  each  subject  made  four  ratings  . 1 ' 

, j 

on  each  of  the  32  stimulus  sentences.  They  rated  the  degradation  | j 

of  what  they  heard  on  a seven-point  scale,  1-7,  with  "overflow  * | 

bins"  (0  and  8)  at  each  end.  That  is,  if  a stimulus  sounded 
appreciably  better  than  a previous  one  labelled  with  a "1",  the 
subject  was  allowed  to  use  a "0"  response. 

Results 

The  mean  ratings  assigned  to  the  eight  systems  are  shown  in 
Fig.  7.4,  where  the  ratings  are  plotted  against  overall  bit  rate 
including  pitch  and  gain.  Lines  join  each  pair  of  systems  that 
differ  in  only  a single  parameter:  solid  lines  join  all  pairs  of 
systems  that  differ  only  in  frame  rate;  dashed  lines  join  pairs 
of  systems  that  differ  only  in  the  number  of  poles;  and  dotted 
lines  join  pairs  that  differ  only  in  quantisation  step  sise. 
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Consider  first  the  three  lines  leaving  System  A,  at  the 
upper  right  hand  corner  of  the  figure.  For  each  parameter. 
System  A has  the  parameter  value  associated  with  better  speech 
quality.  Bit  rate  can  be  reduced  for  this  system  in  three  ways: 
1)  by  reducing  the  number  of  poles,  2)  by  coarsening  the 
quantization,  or  3)  by  going  to  a VFR  transmission  schedule.  The 
figure  shows  that  reducing  the  number  of  poles  resulted  in  the 
smallest  savings  in  bits,  accompanied  by  a large  loss  of  quality. 
Increasing  the  quantization  step  size  yielded  a slightly  better 
rate  of  bits-saved  per  unit  quality-loss.  Both  the  largest 
savings  in  bits  and  the  smallest  drop  in  quality  were  associated 
with  the  introduction  of  the  VFR  scheme.  Similar  conclusions  can 
be  drawn  from  looking  at  the  gains  in  quality  achieved  by 
increasing  the  bit  rate  of  the  worst  system.  System  H,  at  the 
bottom  left  of  the  figure.  The  smallest  quality  improvement, 
with  the  largest  cost  in  extra  bits,  was  obtained  by  abandoning 
the  VFR  scheme. 

For  one  pair  of  otherwise  identical  systems,  going  from 
fixed  to  variable  frame  rate  reduced  the  bit  rate  by  about  40% 
with  no  effect  on  quality  (see  Systems  C and  D in  Fig.  7.4).  All 
but  three  of  the  quality  differences,  between  pairs  of  systems 
joined  by  lines,  are  extremely  significant  — that  is,  well 
beyond  the  P<.001  level.  The  three  exceptions  were  1)  the 
quality  difference  between  Systems  C and  D,  which  was  not 
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significant;  2)  the  difference  between  Systems  G and  H,  which 
just  failed  to  reach  significance  at  the  P<.05  level,  and  3)  the 
difference  between  F and  H,  which  was  just  significant  (P<.05) . 

There  was  a strong  interaction  between  the  speaker  and  the 
effect  of  number  of  poles.  The  male  speaker's  speech  was 
severely  degraded  by  the  8-pole  systems,  whereas  the  female 
speaker's  speech  was  little  affected.  In  fact,  for  the  female 
speaker,  reducing  the  number  of  poles  yielded  a rate  of 
quality-decline  per  bit-saved  no  greater  than  that  obtained  by 
adopting  VFR  transmission.  The  relative  speech  quality  of 
systems  using  13,  11,  and  9 poles  on  a particular  sentence  was 
highly  dependent  on  the  mean  fundamental  frequency  in  the  test 
sentence.  It  is  likely  that  the  critical  variable  is  not  the 
fundamental  frequency,  but  rather  the  length  of  the  speaker's 
vocal  tract,  which  tends  to  correlate  highly  with  fundamental 
(large  men  have  low  voices). 

Conclusions 

Our  results  confirm  that  VFR  transmission  can  yield 
substantial  savings  in  bit  rate,  with  only  minor  loss  of  quality. 
The  rate  of  bits  saved,  per  unit  quality  loss,  is  highest  for 
savings  achieved  by  VFR  transmission,  and  lowest  for  those 
achieved  by  reducing  the  number  of  poles  used  in  spectral 
modelling  — at  least  for  the  parameter  values  studied  here. 
Secondly,  there  are  major  interactions  between  perceived  speech 
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quality  and  the  fundamental  frequency  of  the  talker,  for  some 
systems. 


7.3.3  Quality  Testing  of  a Perceptual-Nodel-Based  VFR  System 

Subjects  judged  the  degradation  of  quality  caused  by 
processing  speech  through  six  LPC  vocoder  systems.  Two  of  these 
systems  were  versions  of  our  new  VFR  system,  based  on  a 
perceptual  model  (PM)  of  speech  (cf.  Section  4.2).  The  third  was 
our  earlier  log-likelihood-ratio  VFR  system,  and  the  remaining 
three  were  fixed-rate  systems,  one  with  a frame  rate  of  33  fps, 
roughly  equal  to  the  average  frame  rate  of  the  PN  systems, 
another  with  a frame  rate  of  100  fps,  equal  to  the  peak  rate  of 
the  PN  systems,  and  a third  that  had  an  intermediate  rate  of  50 
fps.  Stimulus  materials  were  the  six  phoneme- specific  sentences 
read  by  each  of  six  speakers,  three  male  and  three  female,  as 
described  in  Section  7.2.1.  The  results  show  that  the  quality  of 
the  PM  systems  equalled  or  surpassed  that  of  the  100  fps 
fixed-rate  system,  at  about  one  third  of  the  bit  rate. 

Since  we  have  demonstrated  the  correctness  of  the  rationale 
underlying  VFR  transmission  (Section  7.3.2),  the  next  question  to 
address  is  whether  a better  strategy  can  be  developed  for 
deciding  which  frames  of  speech  data  should  be  transmitted.  Such 
an  improved  strategy,  based  on  a perceptual  model  of  speech,  was 
described  above  in  Section  4.2.  The  purpose  of  the  present  study 


a 
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was  to  make  a formal  compariaon  of  subjective  speech  quality 
between  (1)  our  improved  VPR  scheme  (2  versions),  (2)  our  earlier 
log-likelihood-ratio  VFR  scheme,  and  (3)  three  related  fixed-rate 
systems.  All  six  systems  included  in  the  test  transmitted 
different  subsets  of  the  spectral,  pitch  and  gain  data  which 
resulted  from  analyzing  the  input  speech  at  a rate  of  100  fps, 
using  an  11th  order  predictor.  Each  frame  of  spectral  data  was 
coded  in  46  bits:  6 bits  were  allocated  to  the  first  LAR;  5 bits 
each  to  the  second  and  third;  4 bits  each  to  the  fourth  through 
ninth;  and  3 bits  each  to  the  tenth  and  eleventh  LARs.  Pitch  and 
gain  were  coded  in  6 and  5 bits  respectively.  Average  bit-rate 
and  fr2une-rate  data  for  each  of  the  six  systems  included  in  the 
test  are  shown  below. 


I.D. 

BPS 

Frames  per  second 

LARS 

Pitch 

Gain 

Fixed  Rate: 

F100 

5700 

100 

100 

100 

F50 

2850 

50 

50 

50 

F33 

1900 

33 

33 

33 

Variable  Rate: 

VFR-1 

2320 

36 

34 

28 

PML 

1880 

27 

34 

40 

PMH 

2120 

31 

34 

40 

Table  7.2 

Overall  bit 

rates 

, and  f r ame 

rates 

for 

Coefficients 

(LARS) , Pitch 

, and 

Gain,  for 

the 

three 

fixed-rate  and 

three  VFR  systems  tested. 

The 

first  fixed 

rate  system,  labelled 

F100,  transmitted  at  100 

fps 

— that  is. 

every  frame 

of  data  analyzed 

was 

also 

transmitted.  The 

overall  bit 

rate  of 

the  F100  system  was 

5700 
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bps  (46  spectral  bits  * 11  pitch-and-gain  bits,  100  times  per 
second) . The  other  two  fixed  rate  systems,  labelled  P50  and  P33, 
transmitted  every  second  and  every  third  frame  of  the  data 
analysed  at  100  fps,  respectively.  The  P50  system  was  included 
because  its  bit  rate  and  quality  are  comparable  to  those  of 
LPC-I,  specified  for  the  ARPANET.  However,  P50  differs  from 
LPC-I  (a)  in  signal  sampling  rate  (10  vs.  6.7  KBs);  (b)  in  bits 
per  frame  (46  vs.  56);  and  (c)  in  the  pitch  extraction  scheme. 
The  third  fixed-rate  system,  P33,  was  included  to  demonstrate  the 
substantial  degradation  of  quality  associated  with  a simple 
fixed-rate  system  transmitting  at  about  the  same  average  bit  rate 
as  the  VFR  systems. 

The  three  VFR  systems  represent  two  different  transmission 
strategies,  one  using  a log-likelihood  ratio  decision,  and  the 
other  two  a perceptual-model  based  decision.  The  latter  two 
systems  differ  only  in  the  thresholds  for  determining  which 
frames  of  spectral  data  should  be  transmitted. 

The  log-likelihood  ratio  system,  labelled  VFR-1,  selected 
frames  of  LARs  (analyzed  at  100  fps,  as  for  the  fixed-iate 
systems)  using  our  earlier  single-threshold  log-likelihood  ratio 
scheme,  with  the  threshold  set  at  1.5  dB.  Pitch  and  gain  data 
(coded  in  6 and  5 bits,  as  above)  were  selected  for  transmission 
using  our  double-threshold  PAP  scheme  (cf.  Section  4.3.1), 
applied  to  the  quantized  values.  Threshold  values  were  0 and  1 
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quantized  steps  for  pitch,  and  2 and  3 quantized  steps  for  gain. 

The  two  perceptual-model  based  systems  (PMH  and  PNL) 
selected  spectral  frames  for  transmission,  from  the  same  data 
analyzed  at  100  fps,  using  the  simplified  VFR  scheme  described  in 
detail  in  Section  4.2.8.  The  system  labelled  PNL  (Lower  rate) 
used  a threshold  of  1.3,  whereas  PMH  (Higher  rate)  used  a 
threshold  of  1.0.  Both  systems  transmitted  exactly  the  same 
pitch  and  gain  data.  The  quantized  pitch  data  were  selected  for 
transmission  by  the  single-threshold  PIT  scheme  (Section  4.3.2) , 
with  a threshold  of  0 steps,  and  the  quantized  gain  data  were 
selected  by  the  double-threshold  FIT  scheme,  with  thresholds  of  0 
and  1 steps. 

The  speech  materials  consisted  of  36  test  sentences:  the 
set  of  six  phoneme- specific  sentences  read  by  the  six  speakers 
described  above.  Each  of  the  36  test  sentences  was  processed  by 
each  of  the  6 LPC  systems,  yielding  a total  of  216  stimulus 
sentences.  These  were  recorded  on  tape  in  two  separate  orders, 
each  counterbalanced  so  that  each  speaker  followed  each  other 
speaker  an  equal  number  of  times,  and  similarly  for  the  sentences 
and  systems. 

After  some  preliminary  practice,  subjects  rated  the 
subjective  quality  of  each  of  the  216  stimulus  sentences  on  an 
8-point  category  scale,  with  "overflow  bins"  of  0 and  9.  Each 
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tape  was  rated  in  a separate  separate  25-minute  session.  The 
subjects  were  instructed  to  use  the  full  range  of  the  rating 
scale,  assigning  8's  to  the  "best”  quality  stimuli,  and  I's  to 
the  "worst.”  I^e  overflow  bins  were  to  be  used  only  when  an 
extreme  rating  (1  or  8)  had  been  assigned  to  the  previous 

stimulus,  and  the  following  stimulus  seemed  to  be  even  more 
extreme.  The  five  subjects  who  served  were  all  highly  familiar 
with  vocoded  speech. 

In  Fig.  7.5,  the  mean  ratings  across  all  j speakers, 

sentences,  subjects,  and  replications  (the  2 sessions  for  each 
subject) , are  plotted  against  mean  overall  bit  rate,  for  each  of  | 

the  six  systems.  The  points  representing  the  three  fixed  rate 

systems  are  joined  by  one  line,  and  those  for  the  three  VFR 

1 

- I 

systems  by  a second  line.  For  the  fixed  rate  systems,  reducing 

the  frame  rate  from  100  fps  to  50  fps  resulted  in  a slight  gain 

in  quality,  but  further  reducing  it  to  33  fps  produced  a major  ; 

loss  of  quality.  liie  three  VFR  systems  apparently  produced  quite 

similar  quality,  roughly  equivalent  to  the  F100  and  F50  systems, 

but  at  a bit  rate  comparable  to  the  F33  system.  i i 

■ j 

I 

T-tests  showed  that  several  of  the  apparently  quite  small  | 

differences  in  quality  were  highly  reliable.  The  ratings  were  | 

converted  to  differences  for  the  purpose  of  the  t-tests:  the  j j 

variate  tested  was  the  difference  in  rating  assigned  to  the  two  | 

systems  being  compared,  for  the  same  sentence  by  the  same  speaker  | 
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OVERALL  BIT  RATE  ( kbps) 


Mean  quality  rating  va.  overall  bit  rate  for  3 fixed-rate 
and  3 VFR  ayatema,  all  tranamitting  frames  from  the  same 
data  base.  See  text  for  details. 
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judged  by  the  same  subject  in  the  same  session.  If  there  was  no 
difference  between  two  systems,  this  variate  should  have  zero 
mean.  The  perceptual  model  system  with  the  higher  frame  rate 
(PMH)  yielded  better  quality  than  any  other  system.  PMH's 
quality  advantage  was  highly  significant  (P<.001)  over  all  other 
systems  except  F50;  its  advantage  over  F50  just  failed  to  achieve 
significance  at  the  5%  level  by  two  tailed  test.  The  F50  system, 
in  turn,  was  significantly  better  than  VFR-1  (P<0.05),  but  was 
not  significantly  better  than  the  PML  system,  even  though  F50  had 
a bit  rate  50%  greater  than  PML. 

The  average  ratings  conceal  some  highly  interesting 
differences  between  the  systems,  which  are  illustrated  in 
Fig.  7.6.  In  this  figure,  the  results  are  plotted  separately  for 
each  sentence.  The  points  representing  the  six  sentences  for 
each  fixed  rate  system  are  joined  by  a vertical  line,  since  bit 
rate  did  not  vary  with  sentence  for  these  systems.  For  the 
variable  rate  systems  (PMH,  PML,  and  VFR-1),  the  lines  joining 
the  six  sentences  for  each  system  are  not  vertical,  since  bit 
rate  varied  from  sentence  to  sentence,  as  well  as  quality.  Each 
data  point  consists  of  a digit,  keyed  in  the  caption,  which 
identifies  the  sentence  used. 

When  processed  by  the  PMH  system,  each  of  the  six  sentences 
obtained  an  above-average  rating,  and  the  PMH  system  was  never 
significantly  outperformed  by  any  other  system,  on  any  sentence. 
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Fig.  7.6  Mean  quality  rating  vs.  overall  bit  rate  for  3 fixed-rate 

(F33,F50,F100)  and  3 VFR  (VFR-1,  PMH,  PML)  systems.  Each  of 
the  six  digits  joined  by  a line  to  represent  a system's  per- 
formance correspond  to  the  system's  performance  on  a par- 
ticular test  sentence,  as  follows t 


1.  Why  were  you  away  a year,  Roy? 

2.  Nanny  may  know  my  meaning. 

3.  His  vicious  father  has  seizures. 

4.  Which  tea-party  did  Baker  go  to? 

5.  The  little  blankets  lay  around  on  the  floor. 

6.  The  trouble  with  swimming  Is  that  you  can  drown. 
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On  the  other  hand,  t-teata  ahowed  that  PMB  aignificantly 
outperformed  F100  on  three  aentencea  (Noa.  1,  2,  and  5);  P50  on 
t«fO  aentencea  (Noa.  3 and  4);  F33  on  all  aentencea;  VFR-1  on  one 
aentence  (No.  2);  and  PML  on  three  aentencea  (Noa.  1,  2,  and  6). 
PML  at  ita  beat  performed  aa  well  aa  PMH,  particularly  on  the 
faat-moving  aentence  (No.  4).  However,  four  aentencea  were  rated 
at  or  below  the  overall  mean  for  PML  (Noa.  l,  2,  3,  and  6). 

TTie  VFR-1  acheme  performed  aurpriaingly  well  on  all  except 
the  naaal  aentence  (No.  2) , where  it  achieved  a rating  no  higher 
than  the  F33  ayatem.  The  poor  performance  on  thia  aentence  may 
be  due  to  inadequate  tranamiaaion  of  gain:  VFR-1  uaed  a lower 
average  frame  rate  for  gain  than  did  either  PM  ayatem  (aee  Table 
7.2,  above).  Furthermore,  the  reduction  in  gain  frame-rate  waa 
pronounced  in  the  naaal  aentence  (No  2)  — 15  fpa  for  VFR-1, 
compared  with  30  fpa  for  PML  and  PMH.  On  the  other  hand,  VFR-1 
alao  ahowed  a much  lower  gain  frame-rate  in  Sentence  1,  ‘ whoae 
quality  waa  not  adveraely  affected. 


The  F100  ayatem  performed  aurpriaingly  badly  on  Sentencea  1 
and  2,  both  of  which  have  continuoua  voicing  and  no  very  large  or 
rapid  changea  of  apectrum.  Earlier  work  ahowed  that  theae  two 
aentencea  were  particularly  aenaitive  to  diatortiona  introduced 
by  too  coarae  quantization.  The  "wobbly"  quality  of  theae  two 
aentencea,  aa  proceaaed  by  the  F100  ayatem,  may  be  due  to 
inatability  aa  the  quantization  levela  are  alowly  awept,  in  the 
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absence  of  other  spectral  discontinuities  in  the  speech  material. 
The  results  of  such  instability  would  be  more  noticeable  at  100 
fps  than  at  50  fps,  both  because  the  instability  would  have  more 
opportunity  to  occur,  and  also  because  the  periodicity  of  the 
resulting  distortion  would  be  nearer  to  that  of  the  voice 
fundamental . 


Conclusions 

1.  The  Perceptual  Model  scheme  yielded  the  same  or  even  better 
quality  than  the  fixed  rate  scheme  on  which  it  was  based, 
and  at  substantially  lower  bit  rates. 

2.  The  PMH  system  appears  to  have  achieved  a point  of 
diminishing  returns:  reducing  the  coefficient  frame  rate  from 
31  fps  to  27  fps  (in  PHL)  yielded  significantly  worse  quality 
on  three  of  the  test  sentences,  with  insignificant  savings  in 
bit  rate. 

3.  Since  the  PMH  system  equalled  or  surpassed  the  F100  system  on 
which  it  was  based,  further  improvements  in  quality  can  be 
obtained  only  by  improving  the  design  decisions  that  went 
into  the  PlIB  system.  Several  subsequent  developments  have 

if 

suggested  possible  improvements. 

4.  The  phoneme  specific  sentence  material  yields  results  that 
have  high  diagnostic  value:  for  example,  the  poor  performance 
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of  VFR-l  on  nasals  might  never  have  been  verified  if 
homogeneous  testing  materials  had  been  used. 

7.4  Miscellaneous  Topics 

7.4.1  Phoneme-Specific  Intelligibility  Test 

We  tried  out  a phoneme- specific  intelligibility  test 
slightly  modified  from  one  that  was  developed  by  Stevens  (20»21]. 
The  test  has  two  parts,  one  for  consonants  and  one  for  vowels. 
It  is  a nonsense-syllable  test,  using  closed  response  sets  of  4-8 
items.  Both  of  these  factors  increase  the  difficulty  of  the  test 
over  that  of  the  DRT  (22],  which  is  the  only  other  test  available 
with  similar  diagnostic  power.  Weaknesses  of  the  DRT  are  that  it 
tests  only  single  consonants  in  initial  position,  and  the 
response  set  for  each  item  contains  only  two  English 
monosyllables,  whose  initial  consonants  are  a minimal  pair, 
differing  in  only  one  distinctive  feature.  The  small  response 
set  greatly  reduces  the  efficiency  of  the  test,  since  chance 
performance  is  50%.  In  contrast,  the  Phoneme-Specific 
Intelligibility  test  covers  vowels,  and  single  and  clusters  of 
consonants  both  in  pre-stress  and  in  final  position.  The 
stimulus  items  are  nonsense  syllables  of  the  form  / •'C1VC2/, 
where  /•/  is  an  unstressed  schwa  like  the  first  syllable  of 
"about,"  Cl  and  C2  are  consonants,  and  V is  a stressed  vowel. 
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The  complete  test  consists  of  14  separate  subtests.  The 
first  ten  are  consonant  tests,  each  of  which  uses  a single  closed 
set  of  consonants  from  which  Cl  and  C2  are  drawn.  There  are  four 
versions  of  each  consonant  subtest,  two  of  which  use  one  pair  of 
vowels  as  syllable  nuclei,  and  two  using  a second  pair  of  vowels.  | 

A typical  consonant  test  list  is  sho%m  in  Fig.  7.7.  Each 
consonant  in  the  closed  response  set  appears  four  times  in  each 
list,  once  preceding  and  once  following  each  of  the  context 
vowels.  In  addition,  there  are  three  unscored  filler  items 
(ringed  numbers  in  the  figure)  added  to  prevent  subjects  from 
using  the  symmetry  of  the  test  to  aid  their  responding.  The 
vowel  tests  are  similar,  except  that  each  vowel  appears  four 
times  in  each  list,  in  symmetrical  consonant  context,  and  there 
are  three  different  sets  of  consonant  contexts  for  each  vowel 
subtest.  The  complete  set  of  64  lists  is  given  in  Appendix  11. 

The  test  is  in  most  respects  identical  with  that  reported  by  K. 

N.  Stevens  [20,21].  The  complete  test  has  never  been  published 
before,  and  we  thank  Prof.  Stevens  for  permission  to  include  it 
here. 

One  male  and  one  female  talker  each  recorded  half  of  the  64 
test  lists.  We  ran  preliminary  tests  on  a small  subset  of  the 
lists,  using  four  simulated  vocoders  from  those  specified  for  the 
test  of  our  quality  assessment  method,  described  above  in  Section 
7.2.3.  Although  the  test  results  were  quite  encouraging,  we 
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TEST  NO NAME DATE 

CONSONANTS'  b A.  t«\  V\  V Z 

vowels:  It  A 

I . _5LA  in 
©Lii-Ai- 
(^A>aLTf\ 

4.  ^A^ 

9 .AAJQl- 

7.  ja.A^ 

8.  .m  X n 

9 

•0.  _h.^JL_ 

1 1. 

12  -^•A.y.- 

13  jolA^ 

I a Jl  y A- 

Fig.  7.7  A representative  consonant  test  list  from  the 

Phoneme  Specific  Intelligibility  Test  (Stevens,  1962). 
The  whole  test  is  given  in  Appendix  11. 
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abandoned  further  testing,  since  processing  the  test  lists 
through  simulated,  as  opposed  to  real-time,  vocoders  was 
prohibitively  time  consuming.  More  details  of  the  pilot  tests 
can  be  found  in  [23]. 

The  test  is  probably  the  best  available  for  generating 
high-quality  diagnostic  data  about  real-time  systems,  but  even 
here  it  has  two  drawbacks.  The  test  is  long,  taking  several 
hours  for  each  subject,  for  each  tested  system.  Secondly,  some 
of  the  lists  require  the  listeners  to  be  familiar  with  phonetic 
symbols,  which  means  that  additional  training  is  necessary  if 
skilled  subjects  are  not  available.  A further  problem  is  that 
the  diagnostic  data  consist  of  the  pattern  of  errors  made,  and  if 
the  systems  under  test  are  highly  intelligible  it  may  be 
necessary  to  run  large  numbers  of  subjects,  or  repeat  lists,  to 
accumulate  sufficient  errors.  Of  course,  other  diagnostic  tests 
suffer  the  same  disadvantage,  especially  the  DRT  which  forces  a 
choice  between  only  2 alternatives  for  each  test  item,  resulting 
in  a high  chance  performance  level.  An  alternative  method  for 
increasing  the  number  of  errors  is  to  degrade  the  acoustic  (or 
other)  environment  of  the  speaker  or  listeners.  Ttiis  procedure 
is  appropriate  only  if  the  added  degradation  remains  within  the 
range  to  be  expected  in  the  final  application. 

The  high  face-validity  of  the  test  procedures,  together  with 
their  potential  for  diagnosing  problems  with  specific  types  of 
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7.4.2  Effects  of  Lost  Psckets  on  Intelligibility 

Decisions  on  how  much  speech  to  encode  in  one  packet  for 
transmission  over  the  ARPANET  (and  for  Packet  Radio)  have  been 
made  on  the  basis  of  t«io  factors:  overhead,  and  delay.  Each 

packet  contains  a fixed  number  of  header  bits,  etc.,  and  the  cost 
of  this  overhead  decreases  as  more  speech  is  encoded  in  a packet. 
On  the  other  hand,  packetizing  speech  introduces  a delay  equal  to 
the  duration  of  a packet's  contents  (in  addition  to  other  delays 
due  to  path  length  and  network  response) . Delays  have  serious 
disrupting  effects  on  conversations  [24] , so  delays  must  be 
minimized . 

NSC  Note  No.  78  [25]  was  written  to  point  out  that  there  is 
a further  factor  that  should  be  considered  in  deciding  how  much 
speech  to  encode  in  a packet:  the  effect  on  intelligibility  of 
lost  or  delayed  packets.  Work  with  interrupted  speech,  and  with 
speech  alternated  between  the  ears,  and  with  "temporally 
segmented*  speech  (summarized  in  [26])  shows  that  silent 
intervals  inserted  into  continuous  speech,  whether  the  silence 

displaces  or  delays  the  speech  waveform,  have  a maximally 

* 

\ 
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disruptive  effect  on  intelligibility  when  the  silent  intervals 
are  in  the  range  100  to  300  ms.  This  is  exactly  the  range  of 
silent  intervals  that  would  be  introduced  into  speech  if 
reconstruction  of  the  speech  had  to  continue  in  the  absence  of  a 
packet,  either  lost  or  delayed.  An  alternative  to  leaving  a 
silent  interval  is  to  repeat  the  preceding  packet,  but  this  may 
introduce  intelligibility  problems  of  its  own. 

A possible  solution  was  suggested,  that  involved 
interleaving  the  successive  frames  of  speech  data  in  two 
independent  packets,  one  containing  even-numbered  frames  and  the 
other  odd-numbered  frames.  A lost  packet  would  then  result  in  a 
brief  burst  of  interrupted  speech,  with  silent  intervals  of  20 
ms,  which  would  have  no  effect  on  intelligibility.  The  cost 
would  be  increased  delay.  More  details  can  be  found  in  [25]. 

7.4.3  Descriptor  Inventory  for  Subjective  Quality 

A listening  test  was  conducted  to  identify  terms  descriptive 
of  vocoded  speech  for  listeners  unfamiliar  with  vocoding 
techniques.  The  test  was  carried  out  in  two  stages.  In  the 
first  stage,  the  listeners  were  requested  to  list  adjectives  or 
phrases  that  they  considered  descriptive  of  the  speech  to  which 
they  were  listening.  In  the  second  stage,  they  were  provided 
with  lists  of  words  and  phrases,  and  asked  to  judge  the 
appropriateness  of  each  of  the  items  on  the  lists  to  the  speech. 
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The  speech  samples  were  those  generated  for  the  experiment 
described  in  Section  1,3.2,  together  with  a IIB  kbps  PCN  version 
of  each  of  the  four  sentences,  to  act  as  undegraded  anchor. 
Sentences  were  heard  in  pairs.  The  first  member  of  a pair  was 
always  the  unprocessed  PCN  version  of  the  sentence;  the  second 
member  was  one  of  the  eight  processed  versions  of  the  same 
sentence  spoken  by  the  same  talker.  Listeners  were  encouraged  to 
attend  to  the  ways  in  which  the  standard  (unprocessed)  and  test 
(processed)  samples  differed. 

Listeners  were  17  undergraduates  who  reported  normal 
hearing,  and  had  no  pre^|ious  experience  with  vocoded  speech. 
First,  subjects  listened  to) several  items  and  then  began  making 
their  list  of  descriptors.  After  10  minutes,  these  lists  were 
gathered,  and  previously  prepared  check  lists  were  distributed. 
The  listeners  rated  each  of  the  words  and  phrases  on  these  lists, 
on  a 10-point  scale,  for  its  appropriateness  as  a descriptor  of 
the  processed  speech  they  were  hearing.  Meanwhile,  another  list 
was  composed  consisting  of  items  produced  by  the  listeners  during 
the  first  stage,  and  the  listeners  continued  the  test  by 
assigning  scale  values  to  these  new  terms. 

Table  7.3  shows  the  127  descriptors  presented  for  rating 
during  stage  2 of  the  test.  Table  7.4  shows  the  ten  words  that 
received  the  highest  ratings,  considering  ail  of  the  listeners, 
and  also  considering  two  subsets  of  "best”  listeners.  "Best*  was 
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In  SOM  oC  ths  psirSf  tho  sscood  ssntoneo  has  s ottsllty. 


1 blsry 

30 .sarbXad 

71  rinqy 

106  vavary 

2 boowy 

37  grating 

72  rough 

107  vhaaxy 

3 bouncy 

30  grinding 

7 3 scratchy 

l®*__j»hirring 

4 brassy 

39 9 tuff 

74  sharp 

109  whispary 

S broathy 

40  qutqly 

75  sharp-adgad 

110  tfobbling 

t burbly 

41 guttural 

76  shivary 

111  yodallino 

7 bossy 

42 hissy 

77  shrill 

112  ' whistling 

t chirpy 

43 tellow 

70 ^silvary 

113__ 

jb  inkling 

»__choppy 

44  hwan 

79  slurrad 

114 

thin 

10  chsttory 

45  hua-llka 

00  swooth 

115  swishing 

11  closn 

40  httshad 

01  ssMOth-adgad 

116 Mraaching 

12 ^elieky 

47  husky 

02  soft 

117  runb  ling 

13 ^elippod 

40 ^indistinct 

03 ^spitty 

110 rippling 

14  cosrso 

49 gangling 

04 spluttary 

119  radio-static 

15  cooputsr-  llko 

50 jatky 

05  sputtary  . 

12  0 quavoring 

10  crsckly 

51 Mllow 

06  squavky 

U1 

barsh 

17  croaky 

52 Mtally 

07  sguaaky 

122  fnU 

10 ^crisp 

53  sonotona 

88  staady 

123___ 

_fluttariBg 

10  croaky 

54  ■urswiry 

89 ^stiflad 

124 

flat 

20  daapad 

55  ■usical 

90_. ^strainad 

12  5 achoing 

21 daad 

50  ■utad 

91  stridant 

126 

claar  • ^ 

22  daap 

57  nasal 

92 ^subduad 

127  brofcan 

23 dlffusad 

50 natural 

93 ^talaphonic 

24  dtsconnsctad 

59  noisy 

94  throbbing  ' 

25  distinct 

60 oscillating 

95 tinny 

20  dtstortad 

61  piarcing 

96 trill 

27  drona-ltka 

62  hi-pitchad 

97  twangy 

20 ^dull 

63 pulsating 

90  tvaating 

20  addylnq 

64 puts 

99 ^twittary 

• 

30  alactrontc 

65  raspy 

100  unbrokac 

- 

31 a van 

66 nad-llka 

101  unclaan 

32  frlsiy 

67  raqular 

102  undulatory 

33 flat 

60  rasonant 

103  unavan 

34  fluctuating 

69 ^ravarbarar.t 

104  vibrant 

• 

35  fussy 

70  rich 

105  varbly 

( 

Table  7.3  Descriptor  inventory 
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All 

Best  9 

Best  3 

nasal 

nasal 

nasal 

muffled 

muffled 

muffled 

distorted 

distorted 

fussy 

monotone 

head  cold 

distorted 

blanketed 

garbled 

stuffed  up 

fussy 

dull 

muted 

head  cold 

monotone 

blanketed 

dull 

blanketed 

head  cold 

garbled 

fussy 

damped 

muted 

slurred 

parrot-like 

Table  7.4  Descriptors  with  highest  utility  for  a)  all  subjects, 
b)  the  9 most  consistent  subjects,  and  c)  the  3 moat 
consistent  subjects.  (See  text  for  details.) 
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i; 

i; 

i; 

i; 

i: 


defined  in  terms  of  the  similarity  of  a listener's  ratings  to  the 
group  mean  ratings. ^ 

7.4.4  Reducing  Sequence  Effects  in  Quality  Assessment 

In  tests  of  intelligibility,  there  is  an  objectively  correct 
answer  for  each  test  item,  whereas  in  tests  of  speech  quality, 
the  responses  are  judgments  for  which  there  is  no  correct  answer. 
Consequently,  results  obtained  in  speech  quality  tests  tend  to  be 
highly  subject  to  context  effects.  The  rating  assigned  by  a 
subject  to  a particular  test  item  depends  not  only  on  the  test 
item  itself,  but  on  the  range  of  qualities  associated  with  the 
other  systems  under  test,  and  also  on  which  of  these  other 
systems  were  presented  for  judgment  as  the  preceding  two  or  three 
stimuli.  That  is,  different  ratings  are  given  to  a single 
system,  depending  on  which  system  was  presented  on  the  preceding 
tr ial (s) . 

The  usual  method  of  combatting  sequential  effects  is  to 
counterbalance  the  presentation  sequence,  so  that  every  stimulus 
is  preceded  equally  often  by  each  of  the  other  stimuli  in  the 
set,  so  that  biasses  cancel  out.  Where  large  numbers  of  systems 
are  being  compared,  this  procedure  rapidly  becomes  impractical 
since  the  inquired  number  of  stimulus  presentations  increases 
with  the  square  of  the  number  of  systems  being  compared.  In  the 
FARM  test,  developed  by  Voiers  for  DCA  [27] , the  number  of 
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stimulus  presentations  was  kept  small  by  comparing  only  six 
systems  at  a time,  two  of  which  were  anchor  systems  that  appeared 
in  every  sextet  to  provide  a baseline  for  comparing  different 
sextets.  However,  as  Voiers  points  out,  even  these  carefully 
devised  conditions  failed  to  adequately  control  the  sequence 
effects. 

Sequence  effects  must  depend  on  memory  of  the  perceived 
quality  of  the  stimuli  presented  earlier.  If  the  memory  could  be 
erased,  the  sequence  effects  would  disappear.  One  possible 
method  is  suggested  by  recent  work  on  auditory  short  term  memory, 
on  the  so  called  suffix  effect  [28,29].  These  results  show  that, 
when  a list  of  items  is  presented  for  immediate  recall,  adding  an 
extra  item  to  the  end  of  the  list  (the  redundant  suffix) 
interferes  with  the  auditory  memory  traces  of  the  last  items  in 
the  list,  even  though  the  subjects  knew  what  the  extra  item  would 
be.  That  is,  presenting  a redundant  suffix  erases,  at  least 
partially,  the  memory  traces  of  earlier  items.  Since  this  is 
precisely  the  effect  we  would  like  to  achieve  to  reduce  sequence 
effects  in  quality  tests,  we  carried  out  a study  in  «rhich  we 
adapted  the  suffix  effect  paradigm  for  this  purpose. 

The  method  adopted  was  to  fill  the  silent  intervals  between 
successive  stimuli  with  speech  babble.  The  babble  consisted  of  a 
carefully  controlled  mix  of  six  different  voices,  reading  a 
variety  of  passages,  which  had  been  developed  at  BBN  as  part  of  a 
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separate  project  [30].  To  test  the  method,  we  repeated  the 
earlier  quality  study  of  VFR  vocoders  reported  above  in  Section 
7.3.2,  using  seven  of  the  eight  original  subjects.  A new 
stimulus  tape  was  prepared  of  the  same  stimuli,  in  the  same 
presentation  order.  The  babble,  at  the  same  level  as  the  signal, 
was  automatically  faded  out  and  in  again  one  second  before  and 
after  each  stimulus  presentation. 

Each  of  the  two  experiments  showed* a highly  significant 
assimilative  sequence  effect.  The  hoped-for  difference  between 
the  two  experiments,  ascribable  to  the  intervening  babble,  was 
not  significant  by  t-test  (p<0.15) , although  the  difference  was 
in  the  desired  direction,  suggesting  the  babble  may  have  reduced 
the  sequence  effect  slightly.  In  support  of  this,  all  subjects 
reported  that  the  task  seemed  easier  with  the  babble,  and  that 
the  babble  made  it  harder  to  compare  a stimulus  with  its 
predecessor . 

Comparison  of  the  data  collected  with  and  without  babble 
showed  that  both  experiments  yielded  highly  similar  results, 
except  that  the  speech  appeared  slightly  more  degraded  with 
babble,  perhaps  because  the  babble  consisted  of  a mixture  of 
voices  recorded  under  good  conditions,  and  may  therefore  have 
acted  as  an  undegraded  anchor  against  which  the  eight  vocoder 
systems  appeared  more  degraded  than  in  the  absence  of  the  babble. 
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8.  OBJECTIVE  SPEECH  QUALITY  EVALUATION 

Quality  assessment  of  vocoded  speech  is  often  performed  to 
determine  the  user  acceptance  of  a vocoder,  or  to  compare  the 
performance  of  competing  vocoder  types,  or  to  evaluate  the 
different  choices  of  a given  vocoder's  design  parameters. 
Procedures  used  for  speech  quality  measurement  are  either 
subjective  or  objective,  depending  upon  whether  or  not  they  make 
use  of  subjective  judgments  from  human  listeners.  Subjective 
procedures  require  extensive  testing  with  human  listeners,  which 
is  expensive  in  terms  of  both  time  and  money.  On  the  other  hand, 
objective  measures  would  enable  evaluation  to  be  done  by  computer 
as  well  as  ensure  uniformity  in  speech  quality  evaluation.  Also, 
objective  measures  can  be  incorporated  into  the  design  of  better 
quality  vocoders.  Of  course,  the  validity  of  any  objective 
procedure  must  first  be  established  by  comparing  its  results 
agjinst  subjective  judgments. 

Major  achievements  of  our  objective  speech  quality 
'evaluation  work  have  been:  1)  Formulation  of  a general  framework, 
and  (2)  Development  of  several  usable  objective  quality  measures 
which  produce  results  highly  correlated  with  subjective 
judgments.  The  results  of  our  work  have  been  presented  in  three 
papers,  which  are  included  in  this  report  as  Appendices  12-14. 
Below,  we  provide  a brief  summary  of  these  results. 


-134- 


M 


1. 

i: 

i; 

\: 

i: 

i: 

i: 

i; 

i: 

i; 

i: 

i: 

i: 

i: 

i: 

i: 

i; 

r 


BBM  Report  No.  3794 


Bolt  Beranek  and  Newman  Inc. 


8.1  A General  Framework 


We  formulated  a general  framework  for  the  objective 
evaluation  of  vocoder  speech  quality,  based  on  the  following 
reasonable  assumptions  (For  more  details,  see  Appendix  12) : 

(1)  Speech  synthesized  from  unquantized  LPC  parameters  (14th 
order  LPC  filter,  for  a speech  bandwidth  of  5 kHz) , 
extracted  every  10  ms,  is  of  very  good  quality,  compared  to 
the  original  speech. 

(2)  Except  for  pitch  and  gain,  the  fidelity  of  the  short-time 
speech  spectrum  is  the  principal  determiner  of  quality. 

(3)  The  spectrum  is  uniquely  defined  by  the  linear  prediction 
filter  parameters. 


The  first  assumption  gives  us  an  anchor  point,  defined  in  terms 
of  the  unquantized  LPC  parameters,  against  which  to  compare 
quantized  realizations  of  the  same  utterance.  The  second  and 
third  assumptions  relate  the  filter  parameters  to  speech  quality. 
In  this  framework,  then,  the  problem  of  objective  quality 
evaluation  is  reduced  to  the  following  two  steps:  1)  For  each  10 
ms  frame,  compute  an  objective  error  as  the  distance  or  deviation 
between  the  spectrum  corresponding  to  the  unquantized  LPC 
parameters  and  the  spectrum  corresponding  to  the  quantized  and 
interpolated  LPC  parameters;  and  2)  Combine  all  the  frame  errors 
thus  computed  within  a speech  utterance  into  one  number,  «d)ich 
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becomes  the  objective  speech  quality  score.  Notice  that  the 
described  objective  quality  measurement  procedure  can  be  carried 
out  when  the  LPC  vocoder  is  in  operation. 

8.2  Spectral  Distance  Measures 

To  perform  the  task  of  step  (1)  above,  we  developed  several 
spectral  distance  measures  which  produced  results  consistent  with 
published  subjective  perceptual  results  on  formant  frequency 
difference  limens.  A detailed  description  of  these  measures  is 
given  in  Appendix  13.  Briefly,  given  two  smooth  spectra,  the 
distance  between  them  is  computed  in  three  steps: 

(a)  Normalize  the  two  spectra  by  making  them  have  either  the 
same  geometric  mean  (GN  normalization)  or  the  same  value  at 
zero  frequency  (DC  normalization); 

(b)  Determine  the  error  at  each  frequency  as  the  magnitude  of 
the  difference  in  Ir^ear  spectral  amplitudes  of  the  two 
spectra;  and 

(c)  Compute  the  (weighted)  norm  of  this  error  function  after 
weighting  the  error  with  the  percei^^  loudness  function, 
originally  developed  by  S.S.  Stevens  for  a different 
purpose. 

We  chose  to  study  in  detail  the  use  of  two  distance  measures, 
denoted  below  as  d(GM)  and  d(DC),  which  use,  respectively,  6M  and 
DC  normalization,  in  addition,  we  considered  two  other  measures, 
d(RMS-LOG)  and  d(LAR),  for  comparative  purposes;  the  first  of 
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these  two  measures  computes  the  spectral  distance  as  the  rms 
value  of  the  difference  in  the  log  spectral  amplitudes  of  the  two 
spectra r and  the  second  measure  is  the  Euclidean  distance  between 
the  two  p- vectors  of  LARs  corresponding  to  the  two  spectra. 
Since  LARs  are  readily  available  in  the  problem  at  hand,  using 
the  latter  measure  is  computationally  much  less  expensive  than 
using  any  of  the  other^^ree  measures. 


task,  in  step  (2)  above,  of  combining  the  frame  errors 
into  one  number  involves  first  weighting  the  frame  errors  with  a 
suitable  time-weighting  function  to  reflect  the  relative 
importance  of  the  individual  frames  to  perceived  speech  quality, 
and  then  averaging  the  %feighted  frame  errors.  A detailed  account 
of  the  results  of  our  work  on  this  task,  as  well  as  the  results 
of  correlation  tests  between  our  objective  quality  scores  and 
subjective  judgments  are  given  in  Appendix  14.  Below,  we  give  a 
brief  summary  of  these  results. 


8.3  Time  Weighting  of  Frame  Spectral  Errors 


We  investigated  the  t«ro  time-weighting  methods  described 


below. 


(i)  Filter  Gain  Weighting:  in  this  method,  we  make  the 
reasonable  assumption  that  frame  errors  in  low  energy  regions  of 
an  utterance  have  a smaller  influence  on  quality  judgments  than 
those  in  high  energy  regions.  For  example,  even  large  changes  in 
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I the  spectrum  may  not  be  detected  by  the  listener  if  the  total 

energy  in  the  spectrum  is  low.  We  considered  the  weighting  as  a 
function  of  the  frame  speech  signal  energy  per  sample  expressed  j 

in  decibels.  A piecewise  linear  weighting  function  was  found  to 
produce  good  correlation  between  the  resulting  objective  scores  . i 

and  the  corresponding  subjective  test  results. 

(ii)  Weighting  Based  on  Our  Perceptual  Model ; in  the  second 
type  of  (implicit)  time  weighting  that  we  explored,  «fe  employed 
as  anchor  or  reference  our  perceptual  model  of  speech  instead  of 
the  100  fps  LPC  analysis  data.  That  is,  we  used  the  analysis 
data  only  for  those  frames  for  which  our  new  automatic  VFR  scheme 
(see  Section  4.2)  decided  to  transmit;  for  all  other  frames,  we 
obtained  the  LPC  data  via  linear  interpolation  between  the 
adjacent  transmitted  frames.  In  addition,  we  employed  an 
explicit  time-weighting  in  which  frame  errors  for  the  transmitted 
frames  are  weighted  with  unity,  while  other  frame  errors  are 
weighted  with  a fraction  depending  on  the  duration  of  the 
transmission  interval  to  which  they  belong. 

8.4  Time-Average  of  Weighted  Frame  Errors 

There  are  a number  of  different  ways  of  combining  the 
weighted  frame  errors  into  one  number.  The  simplest  time-average 
is  the  arithmetic  mean  or  straight  average.  We  also  considered  a 
two-term  composite  average:  the  first  term  is  simply  the 
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• 

arithmetic  mean  over  the  whole  utterance,  and  the  second  term  is 
the  arithmetic  mean  over  the  top  10%  of  the  frame  errors.  A 
third  measure  we  investigated  is  the  above  composite  average  but 
with  the  second  term  computed  over  a variable  percentage  of  large 

frame  errors;  this  variable  amount  was  decided  by  the  "skewness" 

t ' 

of  the  frame  error  distribution  over  the  whole  utterance. 

8.5  Correlation  with  Subjective  Judgments 

In  our  initial  studies,  we  compared  our  objective  speech 
quality  scores  against  subjective  test  results  obtained  for  the 
five  utterances  JBl,  AR4,  JB5,  RS6,  and  DK6,  and  for  22  of  the  49 
vocoders  included  in  our  factorial  subjective  speech  quality 
study  (see  Section  7.3.1).  We  computed  two  types  of  correlation 
between  the  objective  and  subjective  data:  (1)  regular,  or 
Pearson's  product-moment,  correlation  (we  shall  call  this  simply 
correlation);  and  (2)  rank  order,  or  Spearman's  rank, 
correlation.  For  the  second  type,  two  sets  of  ranks  are  first 
assigned  to  vocoders  under  study  using  separately  objective  and 
subjective  data,  and  then  regular  correlation  is  computed  between 
the  two  sets  of  ranks.  Correlation  scores  were  used  as  a means 
of  choosing  the  parameters  of  the  time-weighting  and 
time-averaging  schemes  discussed  above. 

Results  obtained  using  the  correlation  study  are  briefly 
summarized  below: 
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(i)  Using  the  spectral  distance  measure  d(DC)  generally 
produced  substantially  lower  correlations  than  using  any  of 
the  other  three  measures  investigated.  Therefore,  we 
eliminated  the  measure  d(i>C)  in  all  our  subsequent  studies, 

(ii)  Correlation  scores  obtained  for  the  utterances  from  male 
speakers  were  generally  higher  than  those  for  the 
utterances  from  female  speakers.  Also,  analysis  of  our 
subjective  speech  quality  test  results  showed  that 
subjective  rating  scores  for  the  utterances  from  female 
speakers  were  relatively  constant  over  the  range  of  the 
number  of  poles  (or  LPC  order)  considered  (9-14  poles) ; in 
contrast,  the  rating  scores  for  male  speakers  exhibited  a 
wide  range  of  variation  (13).  This  suggested  the  variation 
of  the  LPC  order  for  the  anchor  system  as  a function  of  the 
average  fundamental  (or  pitch)  of  the  speaker  over  the 
whole  utterance.  This  technique  was  found  to  slightly 
enhanre  the  correlation  scores  for  the  utterances  AR4  and 


(iii)  An  important  achievement  of  our  objective  speech  quality 
evaluation  work  has  been  that  we  obtained  relatively  high 
correlation  scores.  For  the  measure  d(GN),  correlation  for 
individual  utterances  varied  between  0.8  and  0.96;  rank 
correlation  had  the  range  from  0.8  to  0.9.  For  the  measure 
d(RMS),  these  ranges  were  found  to  bet  0.85  - 0.94  for 
correlation,  and  0.83  - 0.88  for  rank  correlation.  For  the 
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\ measure  d(LAR),  we  obtained  the  ranges:  0.79 


correlation,  and  0.78  - 0.83  for  rank  correlation 


s 
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9.  TOWARDS  REAL-TIME  IMPLEMENTATION 

We  cooperated  with  the  other  sites  in  the  ARPA  conununity  in 
implementing  an  LPC  vocoder  that  transmits  speech  over  the  ARPA 
Network  in  real  time.  Below,  we  first  describe  our  work  to 
develop  a real-time  speech  facility  at  BBN,  and  then  briefly 
summarize  the  specifications  that  we  provided  for  ARPA  LPC-II 
speech  compression  system. 

9.1  BBN  Speech  Facility 

Our  signal  processing  system  was  designed  to  meet  the  needs 
of  both  the  speech  compression  project  and  the  then  existing 
speech  understanding  project.  It  consists  of  the  two  computers, 
the  SPS-41  and  the  PDP-11.  The  SPS-41  has  a dual-port  memory 
interface,  and  we  installed  a dual  channel  A/D  and  D/A  converter 
system.  We  added  an  IMPllA  interface  to  our  system  to  provide  a 
link  to  the  ARPA  Network. 

In  close  cooperation  with  the  Information  Sciences  Institute 
(ISI) , we  worked  on  an  on-line  loader  system  for  the  SPS-41. 
This  consists  of  two  parts,  the  Overlay  Executive  (EXEC)  and  the 
Automatic  Reformatter  (ARF) . The  EXEC  is  an  SPS-41  program  which 
loads  information  from  the  PDP-11  into  the  SPS-41.  ARF  reformats 
the  output  of  the  SPS-41  assembler  in  a way  acceptable  to  the 
EXEC.  It  also  provides  a mechanism  for  attaching  meaningful 
labels  to  SPS-41  program  segments  and  locations. 
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We  modified  the  LPC  programs  and  support  software  supplied 
by  other  ARPA-sponsored  sites  and  by  SPS,  Inc.,  to  run  on  our 
configuration  of  the  PDP-ll/SPS-41  system  and  we  developed  a 
procedure  for  loading  these  programs  from  TENEX  into  the  PDP~11. 
We  worked  towards  locating  and  describing  hardware  problems  in 
the  SPS'>41,  which  appeared  to  be  the  cause  of  system  failures 
after  short  periods  of  successful  operation.  As  part  of  this 
effort,  our  SPS-41  was  moved  back  to  SPS,  where  we  had  one  person 
working  full  time  trying  to  resolve  these  problems  with  the  help 
of  people  from  SPS.  During  that  time,  several  hardware  problems 
were  detected  and  corrected.  Subsequently,  several  versions  of 
the  back-to-back  LPC  software  tiere  successfully  run  for  a 
considerable  length  of  time. 

We  purchased  an  RTll  operating  system  for  our  PDPll/40. 
Upon  delivery  of  this  system,  it  was  modified  to  permit  the  use 
of  the  existing  Telefile/Century  Data  disc.  This  disc  has  a 
storage  capacity  of  500  Mbits  and  has  been  used  for  temporary 
storage  of  computer  programs  and  sampled  speech  signals. 

Our  more  recent  work  has  proceeded  in  two  directions:  (1)  To 
develop  the  PDP11/SPS41  system  for  use  as  a research  tool, 
specifically  for  the  acquisition,  storage  and  playback  of  speech 
waveforms,  and  (2)  To  bring  up  a real-time  vocoder  system  on  the 
ARPANET. 


I . 
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The  real-time  acquisition  and  playback  system  operates  in 
conjunction  with  another  larger  computer  system,  in  this  case  the 
DEC  System  20.  In  typical  operation,  the  real-time  system 
digitizes  and  stores  an  utterance.  The  user  then  has  the 
opportunity  of  listening  to  the  digitized  utterance,  displaying 
it,  editing  out  such  undesirable  features  as  tape  recorder  pops, 
and  in  general,  checking  to  see  that  the  complete  utterance  had 
been  digitized.  Initial  and  final  periods  of  silence  are  edited 
out  in  order  to  save  storage  space.  Once  the  utterance  has  been 
edited  and  checked,  the  digitized  waveform  can  be  transmitted  to 
the  System  20,  to  be  used  in  synthesis  experiments  involving 
different  vocoder  systems.  Any  synthetic  utterances  resulting 
from  these  experiments  can  be  transmitted  back  to  the  real-time 
system,  for  the  user  to  play  out  through  the  D/A  converter.  We 
have  also  implemented  an  interactive  playback  program  on  the 
PDPll,  which  allows  the  user  to  easily  specify  and  play  out  any 
sequence  of  digitized  speech  signals.  This  program  has  been 
quite  useful  for  running  informal  listening  tests,  and  for 
conveniently  and  rapidly  preparing  audio  tapes  for  demo  purposes 
and  for  formal  subjective  speech  quality  tests. 

We  have  developed  support  software  for  the  system,  including 
an  FTP  (File  Transfer  Protocol)  program  which  allows  us  to 
transfer  files  between  the  real-time  system  and  the  System  20  or 
any  other  host  on  the  ARPANET.  We  have  handlers  for  the  IMLAC 
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PDS-1  display  computer,  which  runs  as  a peripheral  to  the  PDPll. 
These  handlers  allow  the  INLAC  to  be  used  as  a high  speed 
terminal  on  the  PDPll  and  at  the  same  time  support  its  display 
functions. 

We  have  also  worked  closely  with  ISI  to  modify  EPOS 
(Environment  for  Processing  of  On-line  Speech) , to  work  with  our 
file  structures.  EPOS  is  required  by  the  existing  versions  of 
the  LPC  vocoder. 

9.2  Specifications  for  ARPA  LPC-II  System 

We  provided  specifications,  in  the  form  of  NSC  Note  No.  82 
[7] , for  ARPA  LPC-II  speech  compression  system,  an  update  of  the 
earlier  system  LPC-I,  for  real-time  implementation  at  various 
ARPA-sponsored  sites.  We  had  previously  developed  the  following 
approaches  for  reducing  the  redundancy  in  the  speech  signal  [1] : 

(1)  optimal  parameter  quantization  using  LARs, 

(2)  variable  frame  rate  (VPR)  transmission  of  LARs, 

(3)  variable  order  linear  prediction,  and 

(4)  Huffman  coding. 

We  recommended  only  items  (1)  and  (2)  for  LPC-II,  in  an  attempt 
to  reap  maximum  benefit  for  the  least  amount  of  effort  in  terms 
of  changes  to  LPC-I.  Our  overall  design  objective  in  arriving  at 
specifications  for  LPC-II  was  to  achieve  average 
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continuous-speech  transmission  rates  of  about  2200  bps.  This  bit 

rate  should  be  contrasted  with  that  of  LPC-1  «diich  is  about  3500  ‘ 

bps. 

There  are  thus  two  major  differences  between  LPC-I  and 
LPC-II.  These  are:  1)  LPC-II  uses  VPR  transmission  of  LPC 
parameters,  whereas  LPC-I  uses  a fixed  frame  rate,  and  2)  use  of 
new  coding/decoding  tables  for  transmission  parameters.  These 
new  tables  were  obtained  using  i 

(a)  uniform  quantization  of  LARs; 

(b)  different  step  sizes  for  different  LARs,  based  on  their 
relative  spectral  sensitivities  (see  Section  3.1) ; and 

(c)  smaller  ranges  (i.e.,  minimum  and  maximum  values)  for 

i 

reflection  coefficients  (or  equivalently  LARs)  , than  were  . 1 

used  in  LPC-I.  These  ranges  were  obtained  from  real  speech 
data,  than  were  used  in  LPC-I. 

Compared  to  LPC-I,  VFR  transmission  yields  a lower  (average) 
frame  rate,  while  new  coding/decoding  tables  employ  fewer  bits 
per  transmitted  frame.  Thus,  both  modifications  contribute  to 
lowering  the  average  bit  rate.  These  modifications  were  based  on 
the  results  of  our  previous  research  {!]. 

Initially,  we  specified  a procedure  in  which  only  the  log  ^ 

area  ratios  were  to  be  transmitted  at  variable  frame  rate)  pitch  | 

and  gain  were  to  be  transmitted  essentially  at  a fixed  rate. 

Later,  in  NSC  Note  96  [8],  we  presented  VFR  transmission  schemes 
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for  pitch  and  gain  also.  Use  of  these  schemes  in  LPC-II  would 
lower  the  average  transmission  rate  to  about  2000  bps  for 
continuous  speech.  (With  the  use  of  a silence  detection 
algorithm,  these  average  rates  may  drop  to  about  1000  bps  or 
less.) 

LPC-II  has  been  implemented  at  CHI  and  ISI.  Upon  informally 
listening  to  speech  from  the  vocoders  LPC-I  and  LPC-II,  provided 
to  us  by  CHI,  we  found,  as  did  CHI,  that  the  speech  quality  of 
LPC-II  was  about  the  same  as  that  of  LPC-I.  The  listening  tests 
also  showed  that  there  was  room  for  improvement  in  speech  quality 
of  LPC-II  by  using  a more  perceptually  based  VFR  transmission 
scheme  for  log  area  ratios  than  the  likelihood  ratio  method 
employed  in  LPC-II  (see  Section  4.2).  As  part  of  the  follow-on 
ARPA  contract,  we  plan  to  implement  a real-time  LPC  system  that 
employs  our  perceptual-model-based  VFR  transmission  scheme. 
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10.  HISCBLLAS  lUS  TOPICS 

Two  additional  issues  that  ve  investigated  during  this 
project  are  reported  in  this  section, 

10.1  Coding  of  LPC  parameters  Using  DPCM 

Differential  Pulse  Code  Modulation  or  DPCM  is  a well-known 
method  for  quantizing  signals  which  exhibit  high  correlation 
between  successive  samples.  This  method  has  been  widely  used  for 
coding  speech  signals.  Following  a recent  work,  we  used  the  DPCM 
method  for  coding  the  LARs,  pitch  and  gain.  Bach  of  these 
transmission  parameters  was  considered  as  a discrete-time  signal 
with  time  instants  given  by  the  frame  number.  DPCM  was  applied 
to  each  of  these  signals  independently  of  others. 

We  applied  the  DPCM  method  for  coding  the  14  transmission 
parameters  (12  LARs,  pitch  and  gain)  extracted  at  a fixed  rate  of 
50  frames/sec  from  10  kHz  sampled  speech  [6,14].  The  resulting 
transmission  bit  rate  was  about  2000  bps.  The  DPCM  coder  of  each 
parameter  required  the  knowledge  of  its  averaged  standard 
deviation  in  order  to  compute  the  quantization  step  size  employed 
by  the  coder.  We  observed  improved  speech  quality  either  when 
this  averaged  standard  deviation  was  updated  by  computing  it  over 
a current  speech  segment  of  about  2-3  seconds,  or  when  an  ADPCM 
coder  (which  adaptively  changes  the  step  size)  was  used. 


1. 
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I With  the  use  of  our  VFR  transmission  scheme,  the  correlation 

between  adjacent  transmitted  frame  data  is  greatly  reduced,  which 
means  that  the  DPCH  coder  when  used  with  the  VFR  scheme  will 

. 4 

yield  little  or  no  savings.  Also,  the  above-mentioned  2000  bps 
{ : DPCM-coded  speech  was  found  to  have  a slightly  inferior  overall 

I quality  compared  to  the  speech  at  1500  bps  from  an  earlier 

‘ ' version  of  our  VFR  system  [1].  On  the  other  hand,  the  DPCM  coder 

I has  two  main  advantages:  (1)  it  produces  nearly  fixed-rate  bit 

stream,  and  (2)  the  hardware  implementation  of  the  DPCM  coder  and 
I , decoder  is  relatively  simple  and  inexpensive. 

I 10.2  Linear  Predictive  Formant  Vocoder 

I It  has  been  known  for  some  time  that  formant  vocoders  enable 

speech  transmission  at  very  low  bit  rates  (about  500  bps) . One 
{ requires  of  these  systems  an  acceptable  level  of  speech 

. intelligibility  but  not  necessarily  retention  of  naturalness  of 

* speaker  characteristics.  Such  low-bit  rate  systems  are  of 

I interest  in  some  applications.  Speech  transmission  through  an 

underwater  channel  is  a good  example. 

^ ' We  conducted  a preliminary  experiment  simulating  a formant 

I vocoder  within  our  LPC  system  format.  Formants  were  generated 

from  LPC  analysis  data.  The  formant  synthesizer  was  implemented 
{ . not  using  resonators  as  in  conventional  formant  vocoders,  but 

[ employing  the  canonical  or  direct  form  realization  of  the  linear 
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prediction  all-pole  filter.  The  predictor  coefficients  of  the 
all-pole  filter  were  computed  from  the  received  formant  data.  It 
is  this  difference  in  synthesizer  implementation  which  enabled 
our  formant  vocoder  to  overcome  some  of  the  problems  encountered 
by  its  predecessors.  The  LPC  formant  vocoder  can  accommodate 
variable  number  of  formants  in  adjacent  frames  without  causing 
any  undesirable  transients.  Incorrect  identification  of 
formants,  which  in  practice  can  occasionally  happen  due  to 
imperfect  formant  tracking,  produces  less  degradation  in  the 
quality  of  synthesized  speech  for  the  LPC  formant  vocoder  than 
for  its  conventional  counterparts.  A third  advantage  stems  from 
the  result  we  reported  in  [1]  that  the  parameters  of  the  LPC 
synthesizer  filter  can  be  updated  time-synchronously  without 
introducing  any  transients.  It  is  well-known  that  such 
transients  occur  if  one  updates  the  parameters  of  the  resonators 
time-synchronously. 

In  the  preliminary  experiment,  we  employed  the  formant  data 
already  computed  in  our  Speech  Understanding  Project.  There,  a 
14-pole  LPC  analysis  was  done  every  10  ms  on  speech  sampled  at  10 
kHz  and  preemphasized  using  a 50  Hz  first-order  filter.  The 
formant  tracker  used  in  that  project  then  extracted,  every  10  ms, 
up  to  a maximum  of  3 formants  in  the  frequency  range  0-3100  Hz. 
For  unvoiced  sounds,  often  only  two  formants  were  determined. 
Gain  and  pitch  were  also  computed  every  10  ms.  For  the  purposes 
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of  the  preliminary  experiment,  we  did  not  quantize  any  of  these 
analysis  parameters.  The  receiver  thus  had  a variable  order 
synthesizer.  The  synthesized  speech  was  found  to  be  quite 
intelligible  except  for  the  following  type  of  problem:  [s]  was 
often  perceived  as  [sh].  The  reason  for  this  problem  is  that  [s] 
has  significant  energy  concentration  above  3.1  kHz  unlike  [sh] 
and  that  we  essentially  low-pass  filtered  speech  at  3.1  kHz  by 
considering  only  those  formants  below  this  frequency. 

I Encouraged  by  the  results  of  the  above  work,  we  conducted  a 

more  detailed  study  of  very  low  bit-rate  speech  compression 
systems,  with  support  from  ARPA-STO.  The  results  of  this  work 
have  been  reported  in  [15]. 
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Stable  and  Efficient  Lattice  Methods 
for  Linear  Prediction 

JOHN  MAICHOUL,  member,  ieee 


Abmel-A  dMt  of  stable  and  efficient  lecunive  lattice  methods  for 
linear  predktibo  is  |»eaented.  These  methods  fuarantee  the  stability 
of  the  all-pole  filter,  with  or  without  windowing  of  the  signal,  with  fi- 
nite wordkngth  computations,  and  at  a computational  cost  comparable 
to  the  traditional  autocorrelation  and  covariance  methods.  In  addition, 
for  data-compression  purposes,  quantisation  of  the  reflection  coeffi- 
cients can  be  accomplished  within  die  recursion,  if  dedred. 


I.  Introduction 

The  autocorrelation  method  of  linear  prediction  [1]  guar- 
antees the  stability  of  the  all-pole  filter,  but  has  the  dis- 
advantage that  windowing  of  the  signal  causes  a reduction  in 
spectral  resolution.  In  practice,  even  the  stability  is  not  always 
guaranteed  with  finite  wordlength  (FWL)  computations  [2] . 
On  the  other  hand,  the  covariance  method  [1] , [3]  does  not 
guarantee  the  stability  of  the  filter,  even  with  floating-point 
computation,  but  has  the  advantage  that  there  is  no  window- 
ing of  the  signal.  One  solution  to  these  problems  was  given  by 
Itakura  [4]  in  his  lattice  formulation.  In  this  method,  filter 
stability  is  guaranteed  with  no  windowing  and  with  much 
smaller  sensitivity  to  FWL  computations.  Unfortunately,  this 
is  accomplished  with  about  a fourfold  increase  in  computation 
over  the  other  two  methods.  A similar  method  was  indepen- 
dently proposed  by  Burg  [5] , [6] . 

This  paper  presents  a class  of  lattice  methods  that  guarantees 
the  stability  of  the  all-pole  filter,  independently  of  the  station- 
arity  properties  and  the  duration  of  the  signal.  It  is  shown 
that  the  methods  of  Itakura  and  Burg  are  special  cases  of  this 
class  of  methods.  Furthermore,  a procedure  is  given  that  re- 
duces the  number  of  computations  to  values  comparable  to 
those  in  the  autoconelation  and  covariance  methods.  In  this 
procedure,  the  “forward”  and  “backward”  residuals  are  not 
computed;  the  reflection  coefficients  are  computed  directly 
from  the  covariance  of  the  input  signal. 

Section  II  presents  the  class  of  lattice  methods  for  comput- 
ing the  reflection  coefficients,  along  with  conditions  for  ensur- 
ing stability.  Section  III  describes  a procedure,  termed  the  co- 
variance-lattice  method,  for  performing  the  necessary  compu- 
tations efficiently.  Computational  issues  ate  then  discussed  in 
Section  IV,  followed  in  Section  V by  a step-by-step  procedure 
for  (Mie  of  the  promising  lattice  methods  for  linear  predictive 
analysis. 

Manutciipt  teceiveil  May  S,  1976;  tevlaed  September  9,  1976,  and 
May  13, 1977.  This  work  wu  Mpported  by  the  Information  Processing 
Techniques  Branch  of  the  Advanced  Research  Projects  Agency  under 
Contracts  MDA903-7S-C-0180  and  N00014-7S-C-0S33. 

The  author  is  with  Bolt  Beranek  and  Newman  Inc.,  Cambridge,  MA 
02138. 


II.  Lattice  Formulations 

In  linear  prediction,  the  signal  spectrum  is  modeled  by  an  all- 
pole spectrum  with  a transfer  function  given  by 


(1) 


where 


A(z)=f;  dfcZ-*,  ao  = l (2) 

a-o 

is  known  as  the  inverse  filter,  G is  a gain  factor,  4^  are  the  pre- 
dictor coefficients,  and  p is  the  number  of  poles  or  predictor 
coefficients  in  the  model.  If  //(z)  is  stable  (minimum  phase), 
A(z)  can  be  implemented  as  a lattice  filter  [4] , as  shown  in 
Fig.  1.  The  reflection  (or  partial  correlation)  coefficients 
in  the  lattice  are  uniquely  related  to  the  predictor  coefficients. 
Given  , 1 < m < p,  the  set  is  computed  by  the  recur- 
sive relation 

“•n  I'm 

= l</<m-l,  (3) 

where  the  equations  in  (3)  are  computed  recursively  for  m = 1 , 
2,  • • • , p.  After  each  recursion,  the  coefficients 1 </ < 
m,  are  the  desired  coefficients  for  the  mth-order  predictor. 
The  final  solution  is  given  by  ay  = a^\  !</  < p.  For  a stable 
//(z),  one  must  have 

|A„|<1,  l<m<p.  (4) 

In  the  lattice  formulation,  the  reflection  coefficients  can  be 
computed  by  minimizing  some  norm  of  the  forward  residual 
fmin)  or  the  backward  residual  bm(h),  or  a combination  of  the 
two.  From  Fig.  1 , the  following  relations  hold: 

fo(n)  = bo(n)  = s(n)  (Si) 

/mM(«)=/m(«)  + ^m*I*m(«  ' D (5b) 

bm*l(«)  = ^m*l/m(«)  + bm("-  0 (5c) 

sdiere  s(n)  is  the  input  signal  and  e(n)  = fp(n)  is  the  output  re- 
sidual. In  z-transform  notation:  £(z>b  A(z)S(z). 

We  shall  give  several  methods  for  the  determination  of  the 
reflection  coefficients.  These  methods  depend  on  different 
ways  of  correlating  the  forward  and  backward  residuals.  Be- 
low, we  shall  make  use  of  the  following  definitions: 

F„in)^E[/l(n)]  (6a) 

fl;,(n)-£[bi;,(«)]  (6b) 
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Fig.  1.  Lattice  inverie  fUtei  /4(z). 


C„(n)  = E[f„{n)b„in  - 1 )) . (6c) 

where  £(*)  denotes  the  expected  value.  The  left-hand  side  of 
each  of  the  equations  in  (6)  is  a function  of  n because 
we  are  making  the  general  assumption  that  the  signals  are  non- 
stationary. (Subscripts,  etc.,  will  be  dropped  sometimes  for 
convenience.) 

A.  Forward  Method 

In  this  method,  the  reflection  coeHicient  at  stage  m -t  1 is 
obtained  as  a result  of  the  minimization  of  an  enor  norm  given 
by  the  variance  (or  mean  square)  of  the  forward  residual 

^«*i(")=fij2;M(«)i.  (7) 

By  substituting  (Sb)  in  (7)  and  differentiating  with  respect  to 
one  obtains 

_ E[Un)b„{n-  1)1  _ C;,(>i) 

"•**  £(h>-l)l  ■ 

This  method  of  computing  the  filter  parameters  is  similar  to 
the  autoconelation  and  covariance  mediods  in  that  the  mean- 
squared  forward  residual  is  minimized. 

B.  Backward  Method 

in  this  case,  the  minimization  is  performed  on  the  variance 
of  the  backward  residual  at  stage  m 1 . From  (Sc)  and  (6b), 
the  minimization  of  ^ i (n)  leads  to 

.6  _ E[f„in)b„{n-\)]  C„(n) 

F„{ny 

Note  that,  since  F„{n)  and  fi;„(n  - 1)  are  both  nonnegative 
and  the  numerators  in  (8)  and  (9)  are  identical,  and  k!*  al- 
ways have  the  same  sign  S 

5 = sign/r^=signA*.  (10) 

C.  Geometric-Mean  Method  (Itakura) 

The  main  problem  in  the  previous  two  techniques  is  that  the 
computed  reflection  coefficients  are  not  always  guaranteed  to 
be  less  than  1 in  magnitude;  i.e.,  the  stability  of  //(z)  is  not 
guaranteed.  One  solution  to  this  problem  was  offered  by  Ita- 
kura [4]  where  the  reflection  coefficients  are  computed  from 

W - EUm(n)b„in  - 1)] 


- - - (II) 

it  the  negative  of  the  statistical  correlation  between 


■ 1):  hence,  property  (4)  follows.  To  the 
author’s  knowledge,  (11)  caimot  be  derived  directly  by  mini- 
mizing some  error  criterion.  However,  from  (8),  (9),  and  (1 1), 
one  can  easily  show  that  is  the  geometric  mean  of  AT^and 

k” 

(12) 

where  S is  given  by  (10),  and  we  have  omitted  the  subscript 
m 4 1 . From  the  properties  of  the  geometric  mean,  it  follows 
that 

min  Il//|,|/:*ll  <1X^1  < max  [i#:^I,ia:*i1. 

Now,  since  |#^|  < 1 , it  follows  that  if  the  magnitude  of  either 
K^orK*  is  greater  than  I,  the  magnitude  of  the  other  is  neces- 
sarily less  than  1.  This  important  property  can  be  summarized 
by  the  following. 

If|#:^|>l,  then  ia:*i<i. 


if|Ar*|>l,  then  lAT^Kl.  (13) 

Property  (13)  immediately  brings  to  mind  another  possible 
definition  for  the  reflection  coefficient  that  guarantees  stability. 

D.  Minimum  Method 

A:"  = 5min  {|//|,(/:*|J.  (14) 

This  says  that  at  each  stage,  compute  and  and  choose  as 
the  reflection  coefficient  the  one  with  the  smaller  magnitude. 
Property  (13)  guarantees  that  sr^fies  (4). 

E.  General  Method 

Between  and  K'  there  are  an  infinity  of  values  that  can 
be  chosen  as  valid  reflection  coefficients  (i.e.,  |^|  < 1).  These 
can  be  conveniently  defined  by  taking  the  generalized  rth 
mean  of  //and  Af* 

/r=5[^(i/:Y  + i/:'’ni'''’-  (is) 

As  r -r  0,  AC'  -♦  the  geometric  mean.  For  r > 0,  AT^  cannot 
be  guaranteed  to  satisfy  (4).  Therefore,  for  AC^  to  be  a reflec- 
tion coefficient,  we  must  have  r < 0.  In  particular 

A/=A:^  A-"=Ar".  (16) 

If  the  signal  is  stationary,  one  can  show  that  A/  * AT^,  and  that 

/:'■  = //=  a/,  allr  (Stationary  Case).  (17) 

F.  Harmonic-Mean  Method  (Burg) 

There  is  one  value  of  r for  which  A/  has  some  interesting 
properties,  and  that  is  r = - 1.  AT"* , then,  would  be  the  har- 
monic mean  of  a/  and 

A/  + A/  F„in)*B„{n-  1)-  ^ ^ 

One  can  show  that 

(19) 
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Notf.  Twmt  of  onbr  p hRV«  bMn  MilRctRd. 

In  fact,  Itakura  uaed  AT^  u an  approximation  to  in  (1 1)  to 
avoid  computing  the  square  root. 

One  important  property  of  if*  that  is  not  shared  by  K‘  and 
K**,  is  that  A.’*  results  Erectly  from  the  minimization  of  an 
error  criterion.  The  error  is  defined  u the  sum  of  the  vari* 
ances  of  the  forward  and  backward  residuals 

(20) 

Using  (S)  and  (6),  one  can  show  that  the  minimization  of  (20) 
Indeed  leads  to  (18).  One  can  also  show  that  the  forward  and 
backward  minimum  errors  at  stage  m -f  1 are  related  to  those 
St  stage  m by  the  following: 

^'mFi(")-[l-(A:*Fi)’lF»  (21a) 

2^.ti(")-(l-(^Fi)’lfim("- 1).  (21b) 

This  fomiulation  is  originally  due  to  Burg  (5] , [6] . 

G.  Discussion 

Note  that,  in  general,  lattice  methods  do  not  minimize  any 
global  error  criterion,  such  as  the  variance  of  the  final  forward 
residual,  etc.  Any  minimization  that  might  take  place  is  done 
stage  by  stage.  If  the  signal  s(n)  is  truly  stationary,  the  stage- 
by-stage  minimization  gives  the  same  result  u global  minimi- 
zation. in  fact,  for  a stationary  signal,  all  the  lattice  methods 
previously  described,  as  well  as  the  autocorrelation  and  co- 
variance  methods,  give  the  ume  result.  However,  in  general, 
the  signal  cannot  be  assumed  to  be  stationary  and  the  different 
lattice  methods  will  give  different  results,  which  are  still  dif- 
ferent from  the  covariance-method  result.  The  lattice  methods 
will  indeed  give  suboptimal  solutions;  solutions  that  tend  to  an 
optimal  solution  u the  signal  becomes  more  stationary.  Which 
lattice  method  to  chooaa  In  a particular  situation,  then,  is  not 
clear  cut.  We  tend  to  prefer  the  use  of  AT*  in  (18)  because  it 
minimizes  a reasonable  and  well-defined  error  criterion. 


iimnTieii  sr 


From  the  recursive  relations  in  (3)  and  (S).  one  can  show 


Squaring  (22a)  and  taking  the  expected  value,  there  results 

^m(«)-  f f ( 


♦(A,  O-fU  («-*)*(« -01  (24) 

is  the  nonstationary  autocorrelation  (or  covariance)  of  the 
signal  s(n).  (0(A,  i)  in  (24)  is  technically  a function  of  n, 
which  has  been  dropped  for  convenience.)  In  a similar  fashion 
one  can  show  from  (22b),  with  n replaced  by  n - I , that 


C„{n)  - f X •i'"M’"V(*.  m ♦ 1 - 0.  (26) 

a*e  t'O 

Given  the  covariance  of  the  signal,  the  reflection  coefficient  at 
stage  m t- 1 can  be  computed  from  (23),  (25),  and  (26)  by  sub- 
stituting them  in  the  desired  formula  for  The  name 

**covariance-.|attlce”  stems  from  the  fhct  that  this  is  basically 
a lattice  method  that  is  computed  from  the  covariance  of  the 
signal;  it  can  be  viewed  u a way  of  stabilizing  the  covariance 
method.  One  salient  feature  is  that  the  forward  and  back- 
ward residuals  are  never  actually  computed  In  this  method. 
But  this  is  not  different  from  the  nonlattice  methods. 

In  the  harmonic-mean  method  (18),  F^(h)  need  not  be  com- 
puted from  (23);  one  can  use  (21a)  instead,  with  m replaced 
by  m - I . However,  one  must  use  (25)  to  compute  B^(h-  I ); 


III.  The  Covariance-Lattice  Method 
If  linear  predictive  analysis  is  to  be  performed  on  a regular 
computer,  the  number  of  computations  for  the  lattice  meth- 
ods given  far  exceeds  that  of  the  autocorrelation  and  covari- 
ance methods  (see  the  first  row  of  Table  I).  This  Is  unfortu- 
nate since,  otherwise,  lattice  methods  generally  have  superior 
properties  when  compared  to  the  autocorrelation  and  covari- 
ance methods  (see  Table  II).  Below,  we  derive  a new  method, 
called  the  covurknet-ktric*  method,  which  hu  all  the  advan- 
tages of  a regular  lattice,  but  with  an  efBciency  comparable  to 
the  two  nonlattice  methods. 
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t ! 


, 


(21b)  cannot  be  used  because  - 2)  would  be  needed 

and  it  is  not  readily  available. 


A.  Statiorwy  Case 

For  a stationary  signal,  the  covariance  reduces  to  the 
autocorrelation 


IV.  Computational  Issues 
A.  Simpliflcations 

Equations  (23),  (25).  and  (26)  can  be  rewritten  to  reduce 
the  number  of  computations  by  about  one  half.  The  results 
fot^m('')>n<i^m('')'''^m('i  * 1)  Can  be  shown  to  be  as  follows: 


From  (23)-(27),  it  is  clear  that 

Ml  Ml 


(Stationary).  (27)  C,H(")“«(0.m  + 0+  ^ + 1 - *) 


*•0  /-o 


(28) 


and 


Cm  - t t -/-*). 

*■0  <“0 


(29) 


Making  use  of  the  normal  equations  (1 1 


£ a}"'’/?(i- *)-0,  !<*< 

r>o 


m 


(30) 


and  of  (21),  one  can  show  that  the  stationary  reflection  co- 
efficient is  given  by 


£ ai’"‘/J(m+l-ik) 


jr  ra 

^Ml^l  rv 


(31) 


with  f'o  Ro-  Equation  (31)  is  exactly  the  equation  used  in 
the  autocorrelation  method. 


B.  Quantization  of  Reflection  Coefficients 
One  of  the  features  of  lattice  methods  is  that  the  quantiza- 
tion of  the  reflection  coefficients  can  be  accomplished  within 
the  recursion,  i.e.,  K^,  can  be  quantized  before  com- 

puted. In  this  manner,  it  is  hoped  that  some  of  the  effects  of 
quantization  can  be  compensated  for. 

In  applying  the  covariance-lattice  procedure  to  the  harmonic- 
mean  method,  one  must  be  careful  to  use  (23)  and  not  (21a) 


to  compute  l^(n).  The  reason  is  that  (21a)  is  based  on  the  op- 
timality of  K^,  which  would  no  longer  be  true  after  quantization. 


Similar  reuoning  can  be  applied  to  the  autocorrelation 
method.  Those  who  have  tried  to  quantize  K„  inside  the  re- 
cursion have  no  doubt  been  met  with  serious  difficulties.  The 
leason  is  that  (31)  assumes  the  optimality  of  the  predictor  co- 
efficients at  stage  m,  which  no  longer  would  be  true  if  K„ 
were  quantized.  The  solution  is  to  use  (28)  and  (29),  which 
make  no  assumptions  of  optimality.  Thus  we  have  what  we 
shall  call  the  autocorrelation-lattice  method,  where  there  is 
only  one  definition  of 


Km*l 


*•0  (*0 


Fm 


m rfi 

I I 


(32) 


*•1 


+ ^(*.m  + l)l+  52  ♦(*."«•*•  I - *) 


*•1 


a. I i-T*! 


+ *(i.m+ 1-*)J  (33) 

Fm(»)  + ^(»  - I)  “ ^(0. 0)  + *(m  + 1. m + 1) 


+ 2£  ai'"‘(^(0.*)  + «(m  + l.m  + l-k)l 


+ £ lai"”^]^  l^{k.k)*^{m*l-k,m*l-k)] 
*•1 


m ” I m 

»•!  <»Til 


J0(*,  0 + 0(m  + 1 - m + 1 - i)] . 


(34) 


The  third  term  in  (33)  can  be  computed  more  efficiently  as 
follows; 


Ml 

I 

k>l 


£ (ei'"V«(*.m  + l-*) 


{lBl'"VM«<rii-al*}0(*.«  + l-*) 


. /m  + I m + 1 

2 ’ 2 


i 


(35) 


only  If  m odd 


A similar  simplification  can  be  used  in  (34). 

For  the  stationary  case,  (28)  can  be  rewritten  as 


Fm-  Z M(*)-fo  + 252  b„R(k) 

k‘-m 


(36) 


where 


1*1 


(37) 


1-0 


is  the  autocorrelation  of  the  Impulse  response  of  A(z).  By  set- 
ting / ■ f 4 k in  (29),  one  can  show  that  is  reduced  to 


f’m  • £ *1-1) 


(38) 


f>0 
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where 

e/-  Z 0<l<2m  (39) 

k^O 

ia  the  convolution  of  the  impulse  response  of  A(i)  with  itself. 
Equation  (39)  assumes  that  aV"^  * 0 for  it  < 0 and  k>m. 
Equation  (38)  can  be  rewritten  u 

- /t(m  + 1 ) + 2a^r^R(m)  + , , fi(0) 

+ i:'  ♦ I *,♦»)«(*).  (40) 

*•1 

Equation  (39)  can  also  be  rewritten  to  reduce  the  computa- 
tions further. 

B.  Covariance  CompuMion 

The  covariance  0(A,  i)  of  the  signal  is  defined  in  (24)  as  a 
nonstationary  autocorrelation,  which,  strictly  speaking,  should 
be  estimated  by  averaging  over  an  ensemble  of  the  random 
process.  In  practice,  however,  it  is  often  the  case  that  such 
averaging  is  neither  feasible  nor  desirable.  For  example,  in 
most  speech  applications,  one  is  interested  in  analyzing  the 
time-varying  properties  of  a particular  utterance  and  not  the 
whole  ensemble  of  speech  that  a speaker  might  utter.  In  the 
case  where  a single  time  history  of  a random  process  is  avail- 
able for  analysis,  it  is  common  to  describe  that  single  time  rec- 
ord as  nonstationary  if  its  short-term  sample  properties  (such 
as  mean  and  autocorrelation)  vary  significantly  with  time  [8] . 
For  this  situation,  we  give  below  two  methods  for  computing 
the  covariance  of  a signal  that  is  known,  say , for  0 < n < - 1 . 

Method  1: 

0(*,  0 * s(n  - k)s(n  - /),  0 < *,  i < p (41) 

a‘P 

where  p is  the  order  of  the  predictor,  and  the  customary  divi- 
sion by  the  number  of  terms  in  the  summation  (in  this  case 
yv  - p)  has  been  omitted  since  it  does  not  affect  the  solution  for 
the  reflection  coefficients.  If  we  assume  that  (41)  estimates 
the  covariance  at  time  r > 0,  then  the  covariance  at  any  other 
time  r can  be  estimated  by  setting  the  lower  and  upper  limits 
of  the  summation  in  (41)  to  p r andlV-  1 -i-  r,  respectively. 
Note  that  (41)  makes  no  assumptions  about  the  signal  outside 
the  given  range  and,  hence,  is  especially  useful  for  short  dura- 
tions [6]  and  nonstationary  signals.  On  the  other  hand,  if  the 
signal  is  assumed  to  be  zero  outside  the  given  range  (i.e.,  the 
signal  is  windowed),  then  the  signal  is  effectively  forced  to  be 
stationary,  with  an  associated  autocorrelation  given  by 

JV-I-KI 

W)"  Z 0<i<p.  (42) 

n-O 

Method  2:  The  second  method  makes  maximum  use  of  the 
data  in  the  range  0 < n < Af  - 1 . This  is  accomplished  by  re- 
computing the  covariance  for  each  new  lattice  stage  u follows: 

0^(jk,  0“  s(n  - k)s(n  - i),  0<k,i<m  (43) 


where  0 1*  the  covariance  used  in  computing  K„.  The 
computations  in  (43)  can  be  simplified  considerably  by  noting 
that 

♦»i*i(*.O*0iii(*,O  - s(m  - k)s(m  - (),  0<k.i<m. 

(44) 

Therefore,  the  covariance  coefRcients  for  stage  m I can  be 
computed  from  those  for  stage  m using  (44)  in  the  range 0 < k, 
tKm.  For  k^m-ilorf^m-i'l,  (43)  needs  to  be  used. 

It  can  be  shown  that  when  Method  2 for  computing  the  co- 
variance is  used  in  conjunction  with  the  harmonk-mean  com- 
putation in  (18),  the  results  for  the  reflection  coefficients  are 
identical  to  Burg’s  method  as  described  in  [6] . However,  our 
results  here  are  obtained  at  a much  lower  computational  cost. 

For  the  case  where  N » p.  Methods  1 and  2 should  give 
similar  results.  However,  if  is  not  much  greater  than  p,  then 
it  would  seem  reasonable  to  utilize  the  given  data  maximally 
by  using  Method  2. 

There  are  other  possible  methods  for  computing  the  covari- 
ance or  the  autocorrelation  of  the  signal.  Irrespective  of  which 
method  one  chooses,  it  is  important  to  make  sure  that  the  re- 
sulting covariance  or  sutocorrelation  function  is  positive  defi- 
nite. Otherwise,  filter  stability  cannot  be  guaranteed. 

C Computational  Cost 

Table  1 shows  a comparison  of  the  number  of  computations 
for  the  different  methods,  where  terms  of  order  p have  been 
neglected.  The  computations  for  the  autocorrelation-lattice 
and  covariance4attice  methods  are  on  the  order  of  pM  *■  0(p^), 
as  compared  to  SpN  for  the  regular  lattice  methods  where  the 
residuals  are  computed.  For  N » p,  the  new  lattice  methods 
typically  offer  a 3-4-fold  saving  over  the  regular  lattice 
methods. 

When  compared  to  nonlattice  methods,  the  increase  in  com- 
putation for  the  covariance4attice  method  is  not  significant  if 
N is  large  compared  to  p,  which  is  usually  the  case  (compare 
the  first  and  second  rows  in  Table  I).  Furthermore,  in  the 
covariance-lattice  method,  the  number  of  signal  samples  can 
be  reduced  to  about  half  that  used  in  the  autocorrelation 
method.  This  not  only  reduces  the  number  of  computstions 
but  also  improves  spectral  resolution  by  reducing  the  amount 
of  averaging. 

/}.  FWL  Computations 

One  point  of  comparison  between  the  different  methods  is 
the  stability  of  the  all-pole  filter  when  FWL  computations  are 
used.  The  main  comparison  here  is  between  the  autocorrela- 
tion method  and  the  lattice  methods  (the  covariance  method 
cannot  guarantee  stability,  in  general,  even  with  floating-point 
computations).  Under  FWL  conditions,  we  expect  filter  stabil- 
ity to  be  ensured  more  with  the  lattice  methods  than  with  the 
autocorrelation  method.  If,  at  some  stage  of  the  recursion, 
ATm  turns  out  to  be  greater  than  one  because  of  FWL  computa- 
tions, it  can  be  artificially  set  to  a value  less  than  one  to  ensure 
stability.  Such  a scheme  would  work  well  with  the  lattice 
methods,  but  not  with  the  autocorrelation  method  because  in 
the  latter,  global  optimality  of  each  K is  assumed  at  every 
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Stage.  Lack  of  optimality  leads  to  error  propagation,  which  in 
turn  makes  later  stages  more  susceptible  to  instability.  The 
problem  does  not  exist  to  the  same  magnitude  in  the  lattice 
methods  since  consecutive  stages  are  “decoupled,”  with  no 
assumptions  of  global  optimality  being  made.  This  phenom- 
enon is  the  same  as  that  discussed  in  Section  III-B,  which 
allows  the  quantization  of  the  reflection  coefficient  inside  the 
recursion  of  the  lattice  methods. 

V.  Procedure 

Below  is  the  complete  algorithm  for  what  we  believe  cur- 
rently to  be  one  of  the  more  promising  methods  for  linear  pre- 
dictive analysis.  It  comprises  the  harmonic-mean  definition 
(18)  for  the  reflection  coefficients,  and  the  covariance-lattice 
method. 

a)  Compute  the  covariances  0(k,  i)  for  k,  i = 0, 1 , • • • , p. 

b)  m *-0. 

c)  Compute  C,„(n)  and  F„{n)  + B„(n  - 1)  from  (33)  and 
(34),  or  from  (23),  (25),  and  (26). 

d)  Compute « I from  (18). 

e)  Quantize  K„  ^ t • if  desired  (perhaps  using  log  area  ratios 
[7]  or  some  other  technique). 

0 Using  (3),  compute  the  predictor  coefficients 

from  (4'"^}  and  K„  ^ , . Use  the  quantized  value,  if  K„  ^ | 
was  quantized  in  d). 

g)  ffi  m + 1 . 

h)  If  ffi  < p,  go  to  c);  otherwise  exit. 

We  have  used  this  procedure  to  analyze  speech  signals,  with 
the  signal  covariance  estimated  by  (41).  In  general,  the  results 
were  somewhere  between  those  using  the  autocorrelation  and 
covariance  methods.  In  particular,  the  pole  bandwidths  were 
usually  less  than  those  from  the  autocorrelation  method,  but 
greater  than  those  from  the  covariance  method.  In  all  cases 
where  the  covariance  method  gave  unstable  results,  the 
covariance-lattice  method  gave  stable  results. 

While  the  performance  of  all  linear  prediction  methods  tends 
to  deteriorate  (in  terms  of  spectral  accuracy)  as  the  number  of 
signal  samples  N is  sharply  reduced,  we  believe  that  the  pro- 
cedure given  above  should  continue  to  give  better  resolution 
than  the  autocorrelation  method,  and  should  continue  to 
guarantee  stability,  unlike  the  covariance  method,  which  tends 
to  become  unstable  for  short  durations. 


VI.  Conclusions 

This  paper  presented  a class  of  lattice  methods  for  linear  pre- 
diction that  guarantees  the  stability  of  the  all-pole  filter,  with 
or  without  windowing  of  the  signal,  and  with  FWL  computa- 
tions. Also,  for  data-compression  purposes,  quantization  of 
the  reflection  coefficients  can  be  accompUshed  within  the  re- 
cursion, if  desired,  without  affecting  the  stability  of  the  filter. 
It  was  shown  that  the  methods  of  Itakura  and  Burg  are  special 
cases  of  this  class  of  lattice  methods. 

A procedure  was  derived  to  make  these  lattice  methods  more 
efficient  computationally,  with  a cost  comparable  to  the  tradi- 
tional autocorrelation  and  covariance  methods.  The  proce- 
dure, named  the  covariance-lattice  method,  computes  the  re- 
flection coefficients  recursively  in  terms  of  the  covariance  of 
the  signal  and  the  filter  parameters  at  each  stage.  When  used 
with  speech  signals,  this  method  gave  results  somewhere  in  be- 
tween the  autocorrelation  and  covariance  methods. 
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SEQUENTIAL  LATTICE  METHODS  FOR  STABLE  LINEAR  PREDICTION 
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ABSTRACT 

A aequentlal  llnaar  pradlotion  mthod 
oomputas  new  valuaa  for  tha  paraaatara  of 
tha  predictor  on  a aaapla-by-aaapla  basla. 
Under  the  aaauaptlon  of  an  all-pole  (or 
autoragraaalva)  aodal,  a nuabar  of  aathods 
are  developed  in  thla  paper  for 
aequentially  estlaatlng  the  aodal 
paraaeters.  A ooaaon  thread  In  all  the 
developed  aethods  la  that  they  eaploy  the 
lattice  aodel  of  the  linear  prediction 
filter  and  that  they  all  guarantee  the 
filter  stability.  Several  applications  of 
sequential  estlaatlon  are  oonsldere'd  In 
speech  signal  prooesslng.  Vhlla  the  paper 
coirtalns  aalnly  theoretical  dovelopaents, 
res'd/Lts  of  exporlaental  Inveetlgatlons  of 
the  reported  aethods  will  be  presented  at 
the  conference. 


1.  INTRODUCTION 

Recently  a class  of  lattice  aethods 
were  proposed  for  linear  prediction  with 
the  resulting  all-pole  filter  guaranteed 
to  be  stable  [1j.  In  thla  paper,  we 
extend  these  aethods  to  peralt  sequential 
estimation.  A sequential  aethod,  by  our 
definition,  provides  a new  estlaate  for 
the  filter  coefficients  upon  receiving 
each  signal  sample.  Below  we  Halt  our 
discussion  to  the  , all-pole  (or 
autoregressive)  aodel,  ' and  consider 
applications  In  speech  signal  processing. 

Before  we  consider  sequential  linear 
prediction  methods,  we  review  the  lattice 
formulation  for  block  linear  prediction  In 
Section  2.  (A  block  or  batoh-prooesslng 
method  provides  one  estimate  for  the 
filter  coeffiolents  over  a given  block  of 
signal  samples.)  The  types  of  sequential 
methods  developed  In  this  paper  are 
described  In  Section  3.  From  an 
operational  viewpoint,  these  methods  are 
grouped  Into  two  classes:  (1)  Block 
sequential  estimation,  and  (2)  Recursive 
estimation.  Based  on  the  tlae  extent  of 
dependence  of  the  present  estimate  on  past 
signal  samples,  sequential  aethods  are 
grouped  Into  three  olassea:  (a)  Fixed 
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(finite)  memory,  (b)  Growing  memory,  and 
(o)  Fading  meaory.  Section  H deals  with 
block  sequential  estlaatlon  and  Seotlon  5, 
with  recursive  estlaatlon.  Both  aeotlons 
treat  tha  three  aeaory  oondltlons  given 
above.  Seotlon  6 pointa  out  two  laportant 
differences  between  block  sequential  and 
recursive  estlaatlon  approaches. 

Sequential  aethods  require.  In 
general.  Increased  computation  ooapared  to 
block  methods.  There  are,  however,  a 
nuaber  of  potential  advantages  In  having 
the  filter  coefficients  available  on  a 
saaple-by-sample  basla  [2.4,18-17].  These 
advantages,  as  applied  to  speech  signal 
prooesslng,  are  considered  In  Section  7. 

2.  LATTICE  FORMULATION  FOR  LINEAR 
PREDICTION 

The  lattice  formulation  was 
Introduced  In  speech  by  Itakura  [7],  and 
In  geophysics,  by  Burg  [8].  (Burg's 
method  is  known  as  the  aaxlaua  entropy 
aethod.)  Recently,  Makhoul  showed  the 
existence  of  a class  of  such  lattice 
methods  all  of  which  guarantee  the 
stability  of  the  all-pole  filter,  with  or 
without  windowing  of  the  signal;  also, 
stability  Is  leas  sensitive  to  finite 
wordlength  computations  [1]. 
Unfortunately,  these  aethods  (hereafter 
called  regular  lattice  methods)  cause 
about  a four-fold  Increase  in  computation 
over  the  traditional  autocorrelation  and 
covariance  methods  [9J.  To  overooae  this 
drawback,  Makhoul  Introduced  the  so-called 
covariance  lattice  methods:  these  compute 
the  lattice  model  paraaeters  directly  froa 
the  oovarisnce  of  the  signal,  and  thus 
require  about  the  sane  order  of 
computational  complexity  as  the  two 
traditional  methods  [1].  Since  our 
purpose  Is  to  extend  both  the  regular  and 
covariance  lattice  aethods  to  peralt 
sequential  linear  prediction  , we  ahall 
next  explain  the  lattice  model  and 
Introduce  the  necessary  terminology. 

In  linear  prediction,  the  algnal 
spectrum  Is  modelled  by  an  all-pole 
spectrum  with  a transfer  function  given  by 
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H{*)  - G/A{a),  (1) 


P -Ir 

where  A(*)  - t e.  i - 1,  (2) 

k-0  * ” 

is  known  as  the  inverse  filter,  G is  a 
gain  factor,  Sj^  are  the  predictor 

coefficients,  and  p is  the  nueber  of  poles 
or  predictor  coefficients  in  the  nodal. 
If  H(z)  is  stable,  A(z)  can  be  implemented 
as  a lattice  filter,  as  shown  in  Fig.  1. 
The  reflection  (or  partial  correlation) 

coefficients  K_  in  the  lattice  are 
n 

uniquely  related  to  the  predictor 
coefficients.  For  a stable  H(z),  one  must 
have 

< 1,  l^n^.  (3) 


direct  form  implementation  [10].  The 
reflection  coefficients,  which  are  the 
parameters  of  the  lattice  model,  were 
found  to  be  the  best  for  use  in  speech 
transmission  systems  [11].  Also,  the 
reflection  coefficients  have  an 
orthogonality  property  in  the  sense  that 
an  (a-»1  )>stage  lattice  has  its  first  m 
reflection  coefficients  Identical  to  those 
of  the  m-stage  lattice.  Using  this 
property  and  a suitable  criterion,  an 
estimate  of  the  "true"  order  of  the  model 
for  a given  signal  sequence  may  be  readily 
obtained  [9,12].  In  fact,  such  an 
estimate  was  employed  in  the  design  of 
variable  order  linear  prediction  as  a data 
compression  technique  [6]. 

3.  TYPES  OF  SEQUENTIAL  ESTIMATION  METHODS 

Sequential  estimation  methods 
presented  in  this  paper  dan  be  classified 
in,  two  different  ways,  first  by 
Considering  the  operational  aspect  of  the 
estimator,  and  second  based  on  estimator 
memory. 


1 

i 


Fig.  1.  Lattice  inverse  filter. 

In  the  lattice  formulation,  the 
reflection  coefficients  can  be  computed  by 
minimising  some  norm  of  the  forward 
residual  f.U)  or  the  backward  residual 

B 


b^(n),  or  a combination  of  the  two.  From 
Fig.  1,  the  following  relations  hold: 

fgCn)  • bg(n)  ■ s(n),  (4a) 

**aH^l*"*  " *■+! 


s(n)  is  the  input  signal  and  e(n)cfp(n)  is 
the  output  residual. 

There  are  a number  of  methods  for 
estimating  the  reflection  coefficients 
which  satisfy  the  stability  condition  (3)- 
Eaoh  of  these  methods  say  be  extended  to 
perform  sequential  estimation. 

Besides  the  important  stability 
consideration,  there  are  other  factors 
that  favor  employing  the  lattice  model  in 
general.  Lattice  fora  implementation  of 
(1)  produces  a lower  sensitivity  to 
roundoff  noise  than,  for  example,  the 


From  an  operational  viewpoint,  we 
have  two  classes  of  sequential  methods: 

(1)  Block  sequential  estimation, 

(2)  Recursive  estimation. 

A block  sequential  estimator  provides 
saaple-by-saaple  estimates  by  successively 
applying  a block  linear  prediction  method. 
Since,  for  block  linear  prediction, 
covariance  lattice  methods  give  the  same 
results  as  regular  lattice  methods,  but  at 
substantial  computational  savings,  we 
exclusively  consider  the  use  of  covariance 
lattice  methods  in  block  sequential 
estimation.  A recursive  estimator 
determines  a new  estimate  at  time  n as  a 
function  of  the  last  estimate  at  time  n-1 
and  a quantity  that  is  available  at  time 
n.  (This  latter  quantity  may  be  called  a 
"measurement"  at  time  n,  following  the 
control  theory  or  Kalman  filter 
terminology.)  Regular  lattice  methods  and 
a version  of  Hldrow's  least  mean  squares 
method  are  considered  as  examples  of 
recursive  estimation. 

Based  on  the  nature  of  the  estimator 
memory,  we  group  sequential  methods  into 
three  classes.  By  memory,  we  mean  the 
dependence  (direct  or  indirect)  of  the 
current  estimate  on  past  signal  samples. 
The  three  classes  are; 

(a)  Fixed  memory  methods, 

(b)  Growing  memory  methods, 

(o)  Fading  memory  methods. 

The  extent  of  the  estimator  memory  is 
constrained  to  be  constant  for  class  (a); 
as  new  signal  samples  arrive,  the 
estimator  memory  is  updated  such  that  the 
signal  samples  furthest  in  the  past  are 
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discarded  to  make  room  for  tha  most  raoant 
signal  samples.  For  olass  (b),  tha  sise 
of  the  estimator  memory  increases  as  new 
data  is  processed.  Fading  memory  methods, 
which  form  olass  (o),  can  have  either  a 
fixed  or  growing  memory  span,  but  the  most 
recent  data  is  given  greater  emphasis  than 
the  data  further  back  in  time. 

Section  4 below  deals  with  block 
sequential  estimation,  and  Section  5,  with 
recursive  estimation.  Both  sections  treat 
methods  from  each  of  the  three 
memory>based  classes  given  above. 

4.  BLOCK  SEQUENTIAL  ESTIMATION  (BSE) 

Upon  receiving  a sample  s(n),  a BSE 
method  finds  a stable  estimate  (Kg|(n))  in 

two  steps  as  follows.  First,  from  its 
memory  span  (s(n),  s(n-1),...}  (which  has 
a constant  or  increasing  number  of  samples 
depending  upon  whether  the  BSE  method  has 
a fixed  or  growing  memory),  it  computes 
the  covariance  matrix 

♦(n)  - [♦(i,j.n)l,  0<i,j<p,  (5) 


where  ^(l,J,n)  is  the  1-Jth  covariance  at 
time  n.  The  second  step  is  to  apply  any 
of  the  covariance  lattice  methods  given  in 
[1]  to  solve  for  the  lattice  parameters 
K_(n).  He  show  next  that,  under  each  of 

the  three  memory  conditions,  computing 
d(n)  can  be  accomplished  at  slgnlfloant 
computational  savings  by  making  use  of  the 
knowledge  of  *(0-1).  (Of  course,  at  the 
very  beginning  of  the  signal  sequence 
where  the  estimator  is  Just  starting  up, 
the  covariance  matrix  has  to  be  computed 
directly  from  signal  samples.  In  fact,  in 
that  initial  period,  the  first  estimate  is 
available  only  after  a certain  number  of 
signal  samples  have  been  accumulated; 
this  number  is  equal  to  the  size  of  the 
estimator  memory  for  fixed  memory  methods, 
and  equal  to  p-f1  for  growing  memory 
methods.) 

A.  Fixed  Memory 

He  define 
n 

♦(i.j.n)  - E s(k-i)  s(k-j), 
k-n-M+p+1 

0<i,J<p,  (*) 


where  M>p  is  a finite  constant.  It  is 
clear  from  (6)  that  4(i,J,n)>^(J,i,n), 
i.e.,d(n)  is  symmetric.'  Since  the 
definition  of  4(0)  given  by  (5)  and  (6) 
makes  use  of  the  signal  samples 

s(n),s(n-l) 8(n-M-^1).  the  extent  of 

the  estimator  memory  is  M samples.  (One 
could  also  view  the  most  recent  M-p 


samples  as  representing  the  estimator 
memory,  with  the  other  p samples  serving 
as  initial  conditions.)  By  a simple  change 
of  summation  variable  in  (6),  with  r«k-1, 
it  is  easy  to  show  that 

' ! 

B(l,j,n)  - d(i-l, j-l,n-l) , lii»J<p.  C7)  I 


That  is,  the  lower  pxp  submatrix  of  b (n) 
is  identical  to  the  upper  pxp  submatrix  of 
•(n-1).  Therefore,  only  the  first  row  of 
4(n)  has  to  be  actually  computed.  (By 
symmetry,  the  first  column  is  identical  to 
the  first  row.)  It  can  be  easily  shown 
from  (6)  that  the  elements  of  the  first 
row,  d(0,J,n}  are  given  by  the  recursive 
form: 


4(0, j,n)  ■ 4(0,j,n-l)  + min)  stn-j) 

-s(n-M-t-p)  stn-tH-p-j),  0<j<p.  (B) 


The  number  of  multiplications 
required  for  computing  • (n)  by  (7)  end  (8) 
is  only  2(p-f1).  (Compare  this  with  N(p4>1) 
multiplications  required  in  the 
computation  of  e(n)  directly  from  signal 
samples  using  (6).)  The  oovarianoe  lattioe 
method  requires  on  the  order  of 

(p^«’3p^-4p)/2  multiplications  to  solve  for 
the  parameters  [1].  Therefore,  the  total 
number  of  multiplications  per  sample  is 

about  (p^'f3P^)/2,  most  of  this 
computational  load  being  due  to  the 
covariance  lattioe  solution.  In  terms  of 
storage,  the  described  BSE  method  requires 
an  M-sample  flrst-ln  first-out  (FIFO) 
buffer  for  storing  the  most  recent  signal 
samples,  and  also  a storage  of  size 
(p-^1)  (P'*-2)/2  for  the  elements  of  the 
symmetric  covariance  matrix. 

The  above  approach  can  be  easily 
extended  to  provide  a new  linear 
prediction  estimate  once  every  L aignal 
samples  Instead  of  every  sample.  In  other 
words,  the  M-sample  analysis  interval  is 
shifted  forward  in  time  by  L samples  to 
obtain  a new  estimate. 

B.  Growing  Memory 

For  the  growing  memory  condition,  we 
define  the  covariance  as 

♦ d.j.n)  - I s(k-i)  8(k-j),  0<l,J<p.  (B) 
k-p 


where  we  have  assumed  that  the  signal 
sequence  starts  with  s(0).  The  growing 
memory  aspect  of  the  estimator  is  obvious 
from  (9),  where  the  memory  length  is  equal 
to  n^1 . 
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It  ir  readily  seen  from  (9)  that 

- 4(l,j,n-l)  + s(n-i)  8(n-j), 
0<l,J<p  (10) 

Initlailly, 

■ «(p-i)  »(p-j)»  (11) 


♦ (0,J,n)  ■ B «(0, j,n-l)-i-s(n)  8(n-j) 

”®**~**  a(n-M+p)  s(n-M-t-p-j) , 

®lilP*  (14) 


5.  RECURSIVE  ESTIMATION 


The  recursive  computation  of  the 
covariance  matrix  elements  in  (10) 
requires  (p‘«-1)(p-»2)/2  multiplications  per 
signal  sample.  This,  together  with  the 
oovariance  lattice  solution,  bring  the 
total  number  of  multiplications  to  about 

(p^«>lp^-p)/2  for  computing  a new  estimate. 
The  amount  of  storage  required  is  p-fl  for 
storing  the  most  recent  samples,  and 
(p*1)(P'*^2)/2  for  storing  the  covariance 
matrix  elements. 


In  this  section,  we  first  extend  the 
regular  lattice  approach  to  provide 
recursive  estimation  under  each  of  the 
three  memory  conditions.  Next,  we  apply 
Wldrow's  steepest  descent  least  mean 
squares  method  to  the  lattice  model  given 
In  flg.l;  the  resulting  estimator  has  a 
growing  memory  aspect  which  is  different 
from  that  in  the  regular  lattice  approach. 

A.  Regular  Lattice  Approach 


C.  Fading  Memory 

By  fading  memory,  we  mean  that  recent 
data  is  given  more  emphasis  than  past 
data.  This  feature  of  "discounting"  past 
data  oan  be  inoorporated  into  either 
finite  memory  estimators  or  growing  memory 
estlmatora.  Since  the  Introduction  of 
fading  In  growing  memory  methods  Is 
straightforward,  we  consider  that  case 
first. 

Orowlna  Memory  with  Fadlnx:  Covariance 
oomputation  in  (10)  may  be  modified  to 
permit  an  exponential  weighting  of  past 
data  as  follows: 


Od.j.n)  - B ♦(l,j.,n-J)+8(n-i)  8(n-j), 

0<i,J<p,  0<B<1.  (12) 


Notloe  that  If  m1  (no  fading),  (12) 
becomes  identical  to  (10);  If 
(ooaplete  fading),  we  have  fixed  memory 
estimation  (Msp-i-l  In  (6)).  With  (11) 
still  giving  the  Initial  oovarianoe  value, 
(12)  oan  be  rewritten  as 

0(1. j.n)  - I b"”*‘  m()i-i)  s(lt-j), 
k-p 

®d»J5P» 


where  the  exponential  weighting  Is 
explicitly  shown. 

tLui  Hemorv  with  Fadlnx t For  this  case. 
Inspection  of  (13)  and  (6)  suggests  a 
oovarianoe  definition  that  Is  the  same  as 
(13)  except  with  the  lower  limit  for  the 
summation  Index  k being  (n-M^P'fl)  instead 
of  p.  With  this  definition,  (7)  still 
holds;  (8)  Is  modified  as  follows: 


We  shall  consider  Burg's  method  as  an 
Illustrative  example  in  our  discussion 
below.  An  Important  property  of  Burg's 
method  that  is  not  shared  by  other  lattice 
methods,  is  that  the  estimate  of  the 
reflection  coefficients  results  directly 
from  the  minimization  of  an  error 
criterion  [8,1].  The  error  is  defined  as 
the  sum  of  the  mean  square  values  of  the 
forward  and  backward  residuals. 

Referring  to  the  lattice  model  in 
Pig.1,  the  memory  at  the  input  of  stage 
m^l  at  time  n is  represented  by  the 
residual  sequences  (f^(k),  b_(k-1), 

k>n,n-1 , . . . ) . The  estimate  of 

determined  by  minimizing  the  following 
error  8^^^(n),  which  is  the  sum  of  forward 

and  backward  residuals  at  the  output  of 
that  stage: 

where  the  lower  limit  nor  the  aummatlon 
Index  is  left  unspeolflald  to  allow  the  use 
of  either  fixed  memory  or  growing  memory. 
Substituting  (4b)  and  (Ac)  Into  (15),  and 
equating  the  partial  derivative  of  8^^^(n) 

with  respect  to  to  0,  we  obtain 

n 

-2  I f_(k)  b_(k-l) 

. (16) 

r(fi(k)  + b5(k-l)l 

k ■ " 


(Notice  that  with  the  use  of  any  other 
lattice  method  Instead  of  Burg's  method, 
expression  (18)  has  to  be  appropriately 
modified.)  The  result  In  (16)  Is  used  to 
compute  all  the  p reflection  coefficients 
by  substituting  a«0, 1 , . . . ,p-1 , in  that 
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order.  After  a refleotion  ooefflolent  at 
atage  m la  determined,  the  forward  and 
backward  realduala  at  the  output  of  that 
atage  are  computed,  ao  that  can  then 

be  obtained  ualng  (16).  He  have  ohoaen  to 
call  the  aequentlal  procedure  defined  by 
(4)  and  (16)  aa  recuralve  eatlmatlon 
alnce,  aa  will  be  ahown  later,  the 
expreaalon  for  K_^,(n)  In  (16)  can  be 

rewritten  aa  the  aum  of  Kg^^(n-I)  and  a 
correction  term. 

Defining 
" 7 

tin)  - Z COt),  (17a) 

“ k " 

B_(n)  - I bUk-1),  (17b) 

m ^ n 


n 

C (n)  - I f (k)  b„(k-l),  (17c) 

IB  k ^ ^ 

(l6)beeomea 

(Notice  that  the  aum  P_(n)  * B_(n)  could 

■ B 

have  been  defined  aa  one  quantity  which  la 
equal  to  the  aum  of  the  terma  on  the  right 
hand  aldea  of  (17a)  and  (17b).  Thla 
approach  would  reduce  the  atorage  required 
by  the  eatlmator,  and  aa  auch  ahould  be 
preferred  for  actual  Implementation. 
However,  we  ahall  carry  on  the  two 
realdual  norm  aquarea  In  thla  aectlon,  as 
It  allowa  one  to  think  In  terma  of  the 
"phyalcal"  algnala  at  varioua  nodea  of  the 
lattice  ahown  In  Fig.1.)  The  oorrelatlona 
In  (17)  can  be  computed  recuraively  In 
time.  Below  we  deal  with  thla  and  other 
laauea  by  conaldering  each  of  the  three 
memory  oondltlona  aeparately. 

Fixed  Memory s For  thla  oaae,  the  lower 
limit  for  the  aummatlon  Index  k 'in 
(15)-(17)  la  n-H^p-t'l,  where  M la  the  aize 
of  the  eatlmator  memory.  With  thla  lower 
limit  In  (17),  we  obtain 

®-(")  - F„(n-l)+f*(n)-f5(n-M+p) , (19a) 

n B B IB 

B-(n)  - B_(n-l)+b*(n-l)-b*(n-M+p-l),  (19b) 

81  n SB  n 

C_(n)  ■ C (n-l)+f  (n)b  (n-l) 

■ n Bn 

-f  (n-M+p)  b„(n-M+p-l).  (19c) 

Bquatlona  (4),  (19)  >nd  (18)  deacrlbe  the 
aequentlal  eatlmatlon  method  under 
oonaldoratlon.  Excluding  the  caae  N*p+1 
(aee  below),  the  total  number  of 
computations  per  algnal  aample  required  by 
this  method  la  8p  multlplicatlona  and  p 
divisions  if  the  M moat  recent  aamples  of 


the  realduala  f^  and  b^,  O^mlp-I,  are 
atored,  or  5p  multlplicatlona  and  p 
divlaions  If  the  quantltiea  ff,  b£  and 

B _ B 

fgbm  (total  s 3p  X N)  are  atored  Instead. 
In  both  cases,  the  3p  correlations  P^,  B^ 
and  c have  to  be  stored. 

B 

A special  case  of  Interest  la 
obtained  with  Hsp4.i.  For  this  case,  each 
of  the  summations  In  (15)-(17)  degenerates 
Into  a single  term  corresponding  to  ksn. 
In  particular. 


-2  f (n)  b (n-l) 


(20) 


where  the  superscript  a denotes  'single 
term*.  For  this  case,  the  estimate  of  the 
(m-fDth  reflection  coefficient  at  time  n 
depends  on  the  Input  residuals  to  that 
atage  at  that  time  only.  Therefore,  the 
updating  relations  (19)  reduce  to  the 
middle  term  only  In  each  equation,  and 
hence  are  not  explicitly  required.  The 
aequentlal  procedure,  described  by  (4)  and 
(20),  was  suggested  by  Boll  [4];  It 
requires  5p  multiplications  and  p 
divisions  per  signal  sample. 


For  the  fixed  memory  condition,  and 
excluding  the  degenerate  case  Nap^l,  (16) 
can  be  rewritten  In  a recursive  form  aa 
follows: 


(21a) 

(21b) 

n IB 

(21c) 

(21d) 


yi^l(n)  - [2  f^(n-M+p)  b|j^(n-M+p-l) 

-2  f^(n)  bjj^(n-l)  J/(v^^(n) 

(n-M+p)  J . (2Xm) 


The  quantity  Fg^^(n)  that  appears  In  the 

correction  term  in  (21a)  nay  be  termed  a 
"measurement”  at  time  n.  The  gain  term 
given  In  (21b)  has  an  Inverae  relation  to 
V_^,(«F_4'B_),  which  la  the  sum  of  the  norm 

Bv  I B B 
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squares  of  the  two  input  signals  f (k)  and 

b^(k-l)  defined  over  the  aemory  span. 

Although  the  correction  term  in  (21a}  as 
described  by  (21b)>(21e)  seems 
complicated,  it  is  a function  of  only  the 
quantities  and  K_  , at  time  n-1,  and 

the  input  signals  to  stage  m^l  at  time 
Instants  n and  n-M^p.  While  the  recursive 
form  (21)  is  useful  for  studying  some  of 
the  properties  of  the  estimation  process, 
it  should  be  cautioned  that  implementing 
the  sequential  procedure  in  that  form  is 
computationally  less  efficient  than  using 
(19)  and  (18). 

Since  most  of  the  discussions 
presented  for  the  fixed  memory  case,  with 
simple  modifications,  apply  to  the  growing 
memory  and  fading  memory  oases,  we  treat 
those  cases  below  very  briefly. 

Growing  Memory!  With  the  lower  limit  for 
the  summation  index  k in  (15)->(17)  equal 
to  p for  this  case,  we  obtain 

<22a) 

+ ‘>£<“-1)  • <*2b) 
■ n n 

C«<n)  - C^(n-l)  + «„(n)  b„(n-l).  (22c) 
BA  B n 

The  growing  memory  recursive  estimation 
method  is  thus  described  by  (4),  (22)  and 
(18);  it  requires  5p  multiplications  and 
p divisions  per  signal  sample,  and  needs 
to  store  3p  correlations  given  in  (22). 


The  growing  memory  sequential 
estimation  procedure  has  been  used  by  Kang 
[3]  and  by  Srlnath  and  Vlswanathan  [12]. 
Both  references,  however,  do  not  employ 
the  error  criterion  (15).  Following 
Itakura,  Kang  [3]  uses  (16)  as  an 
approximation  to  Itakura's  PARCOR  (partial 
correlation)  coefficients,  while  reference 
[12]  makes  a stationarlty  assumption  in 
deriving  (16). 

Fading  Memory;  In  a manner  analogous  to 
that  resulting  in  (13),  a reasonable  way 
to  introduce  fading  is  to  weight  the  terms 

in  the  summation  in  (15)  by  OlBil. 
It  is  easy  to  see  that  this  weighting  is 
carried  over  to  (16)  and  (17).  For 
recursively  updating  the  correlations, 
(19)  (or  (22))  can  be  easily  modified  in 
the  same  way  as  we  did  for  obtaining  (12) 
(or  (14)). 

B.  Steepest  Descent  Least  Mean 
Squares  (LMS)  Approach 

Widrow's  "noisy"  gradient  LMS 
approach  [13i14]  has  been  used  mainly  for 
sequential  estimation  of  the  predictor 
coefficients.  But,  that  method  does  not 
guarantee  the  stability  of  the  all-pole 
model.  Recently,  Horvath  applied  Widrow's 
method  to  the  lattice  model  of  a pole-zero 
equalizer  filter  [15]<  In  the  absence  of 
zeroes,  the  model  is  as  shown  in  Fig.1. 
Horvath  used  the  lattice  model  primarily 
because  checking  the  stability  of  the 
filter  becomes  a trivial  problem  (see 
(3)}<  Briefly,  the  recursive  relations 
used  in  that  LMS  method  are  as  follows; 


For  the  growing  memory  condition, 
(16)  can  be  rewritten  in  a recursive  form 
that  resembles  the  Kalman  filter  equation 
as  follows: 


K|^l(n) 


Vi(n-l)  + G^j(n). 


lK2«.j^(n)  - Kart-1 

V^j^(n)  - V^^(n-l)  -e  v^^(n). 


(23a) 

(23b) 

(23c) 


where  v^^^  is  defined  by  (21o).  It  is 
interesting  to  note  that  the  estimate 
K2^^(n)  produced  by  the  degenerate  case  of 

the  fixed  memory  estimator  (see  (20)) 
appears  in  the  correction  term  in  (23a)  as 
a "measurement"  at  time  n.  Other  comments 
that  immediately  follow  (21)  apply  to  (23) 
as  well,  with  the  major  difference  that 
the  recursive  fora  (23)  is  much  simpler 
than  (21). 


l^(n)  - Kj,(n-1)  - Oj^ln)  e(n). 

* 1<IB^.  (24) 

K_  - K_(n-1) 

B B 

where  e(n)  is  the  output  residual  fp(n)  in 
Fig.1,  and  m^(n)  is  a step-size  parameter 

that  is  usually  set  to  a small  constant 
value,  but  that  in  general  may  be  a 
function  of  tine.  The  recursive  form  (24) 
is  similar  to  (21)  or  (23)  with  the 
difference  that  the  correction  term  is 
much  simpler  in  (24);  it  is  proportional 
to  the  negative  of  the  instantaneous 

gradient  of  e^(n)  with  respect  to  K_. 

■ 

(The  procedure  to  compute  this  gradient  is 
given  in  [151.) 

With  a fixed  (i.e.,  data  independent) 
step-size  parameter  sequence  {a||(n)),  (24) 

can  lead  to  filter  instability.  To 
overcome  this  problem,  the  step  sizes  nay 
be  changed  whenever  necessary  to  ensure 
that  the  updated  reflection  coefficients 
satisfy  (3).  This  will  guarantee  the 


Be  (n) 
3K_ 
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niter  stability  at  the  expense  of 
altering  the  nature  of  convergence  of  the 
LHS  nethod. 

Prom  an  inspection  of  (24)  it  follows 
that  the  LHS  method  has  a growing  memory. 
In  view  of  the  different  correction  terms 
in  (22)  and  (24),  it  is  evident  that  the 
growing  memory  aspect  of  the  LHS  method  is 
different  from  that  of  the  regular  lattice 
method. 

6.  BLOCK  SEQUENTIAL  VERSOS  RECURSIVE 
ESTIHATION 

Some  of  the  differences  between  the 
two  estimation  approaches  were  already 
stated  in  previous  sections.  Here,  we 
emphasise  two  important  differences. 

First,  for  sample-by-sample 
estimation,  block  sequential  methods  are 
computationally  more  expensive  than 
recursive  methods.  However,  if  estimates 
are  not  desired  every  sample,  block 
methods  can  become  more  advantageous. 

Second,  the  reflection  coefficient 
estimates  computed  by  the  block  sequential 
and  recursive  methods  will  be  different, 
in  general.  To  clarify  this  point,  let  us 
explain  the  operational  details  of  the  two 
approaches  as  follows  (although  specific 
implementations  may  not  explicitly  perform 
these  operations).  For  each  new  signal 
sample,  the  block  sequential  approach  uses 

the  signal  samples  in  the  memory  to 
compute  the  estimate  of  the  first 
reflection  coefficient,  passes  them 
through  the  first  stage  in  the  lattice  to 
generate  the  residuals  f^(k)  and  b^(k)  for 

all  desired  k,  then  computes  the  estimate 
of  the  second  reflection  coefficient  from 
these  residuals,  etc.  On  the  other  hand, 
the  recursive  approach  "ripples”  each  new 
signal  sample,  and  only  that  sample, 
through  the  entire  lattice  to  compute  the 
estimate  of  all  the  reflection 
coefficients.  Thus,  the  previous 
sample-by-sample  (or  instantaneous) 
estimates  of  the  reflection  coefficients 
determine  the  residuals  which  in  turn  are 
used  for  computing  the  current  estimate. 
Therefore,  only  the  estimates  of  the  first 
reflection  coefficient  will  be  the  same 
for  both  approaches;  estimates  of  all 
other  reflection  coefficients  will,  in 
general,  be  different  for  the  two 
approaches.  Notice  that  the  above 
operational  description  also  indicates  why 
block  sequential  methods  are 
computationally  more  expensive  than 
recursive  methods. 

Experimental  comparisons  of  the 
results  from  the  two  estimation  approaches 
will  be  presented  at  the  conference. 


7.  APPLICATIONS  IN  SPEECH  PROCESSING 

Sequential  estimation  has  been  used 
in  a number  of  speech  processing 
applications.  Some  authors  have  dealt 
with  sequential  estimation  of  the 
predictor  coefficients  Sj^  in  (2)  [16-18]. 

Their  methods  do  not  guarantee  the  filter 
stability,  unlike  the  methods  presented  in 
this  paper.  Below  we  briefly  review  the 
applications  of  sequential  estimation 
methods  in  speech  processing. 

Determination  of  the  instants  at 
which  certain  speech  events  occur,  such  as 
glottal  closure,  may  be  accomplished 
through  fixed  memory  sequential  estimation 
[2].  If  the  ratio  of  mean-squared 
prediction  error  e(n)  to  mean-squared 
signal  s(n)  is  used  as  a measure,  and  if 
the  estimator  memory  is  short  compared 
with  the  pitch  period,  then  the  measure 
will  often  show  sharp  peaks  whenever  the 
time  segment  representing  the  estimator 
memory  contains  a glottal  closure. 

Sequential  estimation  has  been  used 
in  pitch  extraction  schemes  by  Naksym  [16] 
and  Boll  [4].  Boll  reported  that  the 
estimation  procedure  given  by  (4)  and  (20) 
produced  a spectrally  flatter  error 
sequence  e(n)  than  block  linear  prediction 
methods  did,  with  the  fundamental 
frequency  more  clearly  evident. 

Next,  we  consider  applications  to 
efficient  speech  transmission  systems.  In 
fixed  frame  rate  systems  using  block 
linear  prediction,  one  set  of  p reflection 
coefficients  (pNl2  for  10  kHx  signal 
sampling  rate)  is  computed  for  every  data 
frame  (typically  20  msec  long),  and 
transmitted  to  the  receiver.  Employing  a 
sequential  estimator  that  is  initialized 
at  the  start  of  a data  frame  and 
terminated  at  the  end  of  the  data  frame, 
one  has  as  many  sets  of  reflection 
coefficient  estimates  as  there  are  speech 
samples  in  the  data  frame.  Transmission 
of  all  of  those  estimates  would 
tremendously  increase  the  bit  rate.  One 
may  select  a best  estimate  in  some  sense 
and  transmit  that  estimate  only.  Kang  [3] 
has  suggested  transmitting  the  estimate 
that  produces  the  minimum  mean  square 
value  for  the  residual  e(n).  Use  of  this 
selection  procedure  with  the  growing 
memory  sequential  estimation  method  given 
by  (4),  (22)  and  (18)  was  found  to  reduce 
the  "wobble"  quality  usually  present  in 
steady  state  regions  of  voiced  sounds  that 
are  synthesized  using  block  linear 
prediction  methods  (3].  Alternately,  one 
may  select  the  median  of  the  estimates  or 
mods  of  the  probability  histogram  formed 
from  the  sample-by-sample  estimates. 
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other  applications  of  sequential 
eatlDatlon  Include  sequentially  adaptive 
DPCM  of  speech  [1?]#  and  potential  use  of 
saople-by-sanple  estlnates  In  deciding 
transnisslon  instances  in  variable  frame 
rate  systems  [5,6J. 
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ADAPTIVE  LATTICE  METHODS  FOR  LINEAR  PREDICTION 
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ABSTRACT 

A ganaral  aathod  for  adaptlva  updating  of 
lattloa  coafflolanta  In  tha  Ilnaar  pradlotlva 
analyala  of  nonatatlonary  slgnala  la  praaantad. 
Tha  Mthod  la  glvan  aa  ona  of  two  laquantlal 
•atlutlon  aathoda,  tha  othar  balng  a block 
saquantlal  aatlaatlon  aathod.  Tha  faat  convarganoa 
of  adaptlva  lattloa  algorlthaa  la  aaan  to  ba  dua  to 
tha  orthogonal laatlon  and  daooupllng  propart laa  of 
tha  lattlca.  Thaaa  propartlaa  ara  usaful  In 
adaptlva  Nlanar  flltarlng.  Aa  an  application,  a 
naw  faat  atart-up  aquallaar  atruotura  la  praaantad. 
In  addition,  a ona-aultlpllar  fora  of  tha  lattlca 
la  praaantad,  which  raaulta  In  a raductlon  of 
coaputatlona. 


1.  INTRODUCTION 

Tha  lattloa  aathod  of  Ilnaar  prediction  waa 
flrat  Introduced  by  Itakura  [1,2]  for  apaach 
analyala.  A alallar  algorltha  waa  propoaad 
Indapandantly  by  Burg  (31  In  gaophyalca.  Recently, 
Makhoul  [A]  ahowad  tha  ealatance  of  a class  of 
lattlca  aathoda  of  which  tha  aathoda  of  Itakura  and 
Burg  ara  apeolal  oaaas.  All  these  aathoda 
guarantee  tha  stability  of  tha  corresponding 
all-pola  filter,  with  or  without  windowing  of  tha 
algnal,  Indapandantly  of  tha  statlonarlty 
propartlaa  and  duration  of  tha  signal,  and  with 
finite  wordlangth  coaputatlona.  Also,  for  data 
ccaprasslon  purposes,  quantisation  of  tha 
reflection  coafflclants  aay  ba  accoapllshad  within 
tha  lattloa  recursion.  In  >ddltlon,  Hakhoul  [A] 
davalopad  tha  so-called  ca  tanoa- lattice  aathoda, 
which  ooaputa  tha  lattloa  a al  paraaatars  froa  tha 
covariance  of  tha  signal,  h a 3-A  fold  aavlng  In 
ooaputatlon  over  tha  aatl..  us  of  Itakura  and  Burg. 

The  only  kr -wn  disadvantage  of  lattlca  aathoda 
la  that.  If  the  algnal  la  not  windowed,  the 
coaputad  aodal  paraaatars  aay  not  alnlalta  tha 
output  aaan-squara  error,  resulting  In  a suboptlaal 
solution  [A],  for  aost  applications,  this 
disadvantage  la  of  no  oonaaquanca. 

In  addition  to  tha  advantages  glvan  above,  tha 
lattloa  has  a aost  iaportant  orthogonal! sat Ion 
property)  tha  "daooupllng*  of  conaeoutlvs  atagaa  of 
tha  lattloa.  This  property  substltutaa  tha  global 
alnlalsatlon  at  tha  lattloa  output  with  a sequanca 
of  local  alnlalsatlon  problaas,  ona  at  each  stage 
of  tha  lattloa. 


This  paper  ssploras  tha  adaptlva  aatlaatlon  of 
lattloa  paraaatara  In  a nonatatlonary  anvlronaent. 
Earlier  related  work  aay  ba  found  In  [VT]. 
Griffiths  [7]  pointed  out  that  tha  daooupllng 
proparty  rasulta  In  a convarganoa  that  la 
independent  of  the  signal.  In  this  paper  wa  show 
how  tha  orthogonallsatlon  property  aay  be  used  to 
great  advantage  In  adaptlva  Wiener  flltarlng.  In 
particular,  wa  present  as  an  application  a now  fast 
start-up  adaptive  aquallsar.  Adaptlva  lattloa 
aathoda  proalss  to  ba  usaful  In  areas  where 
transversal,  pradlotlva,  or  finite  lapulse  response 
(FIR)  filters  ara  used  In  an  adaptlva  Banner. 

2.  LATTICE  PRELIMINARIES 

Fig.  1 shows  tha  basic  two-aultlpllar  lattlca 
of  Itakura  and  Salto  [1,2].  Froa  Fig.  1,  tha 
following  relations  hold; 

f(j(n)  > ggCn)  • s{n)  (la) 

f,(n)  • f^,(n)  ♦ R,g,.,(n-1)  (1b) 

g^(n)  a R„f,^,(n)  ♦ g^^(n-l).  (1c) 

s(n)  Is  tha  Input  signal,  f (n)  Is  tha  "forward* 
residual  at  stage  a,  g^Tn)  la  the  "backward* 

raaldual,  and  K la  the  reflection  coefficient. 
Let  tha  forward  transfer  function  up  to  stage  a ba 

A (a)  - I a (k)  i"'‘ . a (0) -1,  (2) 

a k-O  a m 

where  a^(k)  ara  tha  predictor  coafflolanta  for  an 
ath  order  predictor.  Than  tha  backward  transfer 
function  up  to  stags  a Is  given  by  (2)  with  tha 
order  of  tha  coafflclants  reversed.  Tha  predictor 
coefficients  ara  coaputad  froa  tha  reflection 
coafflclants  using  the  recursion 

a^(o)  ■ 

a^(k)  • a^,(k)  ♦ E,a^,(a-k),  Iklda-I.  (3) 

Tha  stability  of  tha  all-pola  filter  1/A  (t)  Is 
guaranteed  Iff 

IR,U1,  ISaSP.  (A) 
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3.  SEQUENTIAL  ESTIMATION  METHODS 


A sequential  estimation  method,  by  our 
,deriMttlon,  provides  a new  estimate  for  the 
reflection  coefficients  at  each  time  Instant  n.  He 
differentiate  two  types  of  sequential  estimation 
methods  [6]: 

(1)  Block  estimation, 

(3)  Adaptive  estimation. 

Block  estimation  Is  the  usual  method  of  linear 
prediction  analysis,  where  one  value  of  each 
reflection  coefficient  Is  estimated  for  a whole 
block  of  data.  The  analysis  Is  repeated  over  again 
as  each  signal  sample  Is  added  to  the  block  of 
data.  In  contrast  to  block  estimation,  adaptive 
estimation  determines  a new  estimate  at  time  n as  a 
function  of  the  last  estimate  at  time  n-1  and  a 
"measurement”  at  time  n.  Below,  we  present  both 
types  of  estimation  methods. 


In  the  block  method,  K^(n)  la  computed  first, 
using  (7)  and  the  Input  signal  (see(la)).  Then, 
the  residuals  ff(k)  and  gj(k)  are  computed  using 
(1)  for  gii.  time  up  to  n.  Then,  K^Cn)  Is  computed 
from  (7),  followed  by  the  computation  of  f,(k)  and 
g2(k),  and  so  on  for  all  stages.  The  whole^process 
Is  then  repeated  at  time  n»1,  with  the  residuals 
having  to  be  completely  reevaluated  for  all  time  up 
to  n*1.  The  amount  of  computation  Is  clearly 
large,  and  so  Is  the  apparent  amount  of  storage. 
However,  one  can  effect  substantial  savings  In  both 
by  using  the  covariance  lattice  method  Instead  [4], 
and  bv  recursively  updating  the  signal  covariance 
[6].  It  la  Important  to  note  that.  In  the  block 
method,  the  value  of  K (n«-1)  does  nfll  depend  in  any 
simple  way  on  K (n).  This  la  to  be  contrasted  with 
the  adaptive  method  described  in  Section  5. 
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4.  BLOCK  ANALYSIS  OF  NONSTATIONARY  SIGNALS 

He  assume  that  x(n)  Is  a nonstationary  signal, 
and  that  we  wish  to  estimate  the  reflection 
coefficients  K^  at  each  Instant  of  time  n.  He 
shall  taka  advantage  of  the  deoouplli\g  property  of 
the  lattice  (even  though  It  Is  only  approximately 
true  In  the  nonstationary  case),  and  determine  each 
Kg(n)  by  minimizing  some  function  of  the  forward 
and  backward  residual  energies  at  that  stage. 
Furthermore,  since  In  a time-varying  situation  we 
are  mainly  Interested  In  the  moat  recent  history  of 
the  signal.  It  Is  reasonable  to  weight  the 
residuals  such  that  the  more  recent  values  are 
given  greater  Importance.  He  are  thus  led  to 
minimizing  a mean-square  type  of  error  of  the  form: 


Q 


w(n-k)  e^(k) 


(5) 


where  w(n)  Is  the  weighting  sequence,  or  window, 
and  e^Ck)  la  the  residual  energy  at  time  k.  He 
shall  have  more  to  say  about  the  window  later  on. 
As  for  the  residual  energy,  we  shall  consider  here 
only  one  case,  the  sum  of  the  forward  and  backward 
residual  energies: 


e2(k)  . f2(k)  ♦ g^(k).  (6) 


He  point  out  at  the  outset  that  windowing  of 
the  error  in  (5)  Is  vary  different  from  windowing 
of  the  signal.  Hlndowlng  the  signal  results  In  a 
stationary  signal,  while  windowing  the  error  does 
not  affect  the  statlonarlty  of  the  signal;  It 
merely  weights  the  different  error  values.  Signal 
or  data  windows  nay  be  quite  arbitrary,  and  nay 
take  on  positive  and  negative  values.  In  contrast, 
the  error  window  In  (5)  must  be  always  nonnegative. 
In  particular,  we  must  have 

w(n)  i.  0,  niP, 

w(n)  « 0,  n<0.  (8) 

Negative  values  are  not  allowed  since  they  will 
result  In  cancellation  of  errors,  whloh  Is 
generally  undesirable. 

As  examples,  we  shall  give  one  FIR  window  and 
one  recursive  window.  The  FIR  window  Is  the  usual 
rectangular  window  of  width  H: 

w,(n)  « 1,  0in4H-l, 

■ 0,  otherwise.  (9) 

This  window  has  some  bad  effects  as  a signal  window 
but  has  good  properties  as  an  error  window.  The 
recursive  window  Is  the  Impulse  response  of  a 
single  real  pole: 


Substituting  (6)  and  (I)  In  (5)  and  mlnlmlxlng 
Eg(n)  with  respect  to  K_(n)  results  In: 
n 

2.  r w(n-k)  f ,(k)g_  ,(k-l) 
gjn) 5rl (7a) 

_C(.nl  <7b) 

D(n) 

The  value  of  K as  given  by  (7)  la  always  guaranteed 
to  obey  (4).  Other  possibilities  exist  for 
defining  K such  that  (4)  la  guaranteed  [4],  but 
they  will  not  be  discussed  here. 


w^ln)  t b",  n2P, 

• 0,  n<0.  (10) 

From  (10)  and  (7),  one  can  compute  C(n)  and  D(n) 
recursively  using 

C(k)  .BC(k-l)  ♦ Jf,.,(k)g^,(k-1)  (11a) 

D(k)  .BD(k-l)  ♦ f^,(k)  ♦ g^,(k-1)  (11b) 

for  all  k up  to  n.  Other  recursive  windows  may  be 
defined,  but  because  of  condition  (S),  all  such 
windows  must  be  the  Impulse  responses  of  lowpasa 
filters  with  positlYa  real  poles. 


5.  mPTIVE  ESTXMATION 


In  tdaptlv*  astlmtlon,  u«  aisuM  glv«n  K_(n), 
IjMLPi  dt  ti««  n,  and  tha  forward  and  baoKward 
raalduala  up  to  tlM  n.  Tha  problaa  is  than  to 
astlMta  K^(n*^1),  liUiip,  at  tiaa  n»1  ualna  tha 
glvan  quantltlas.  Va  ahali  aaploy  tha  astiaata  in 
(7)  but  in  a diffarant  aannari 

sS  ‘ (12) 

Oiaan  )C_(n)  and  g_(n>1),  lialp,  ona  ooaputaa  f-(n) 
and  g^Tn)  for  ail  stagaa  using  (1).  Than  K-(n»1), 
ara  ecaputad  fro*  (12),  and  so  on.  In 
contrast  with  tha  block  aathod,  the  residuals  ara 
ecaputad  only  onoa  for  aaoh  point  in  tiaa. 

Tha  windows  w^(n)  and  W2(n)  any  also  ba  uaad 
in  adaptive  astiaation.  For  axaapla,  with  W2(n) 
ona  can  use  (11)  with  kan.  It  is  clear  froa  tha 
racuraiva  ooaputation  in  (11)  that  only  6 
aultiplioatlons  (naglaotlng  aultlplioation  by  2) 
and  1 division  are  naadad  to  eoaputa  aaoh  of  the 
raflaotion  ooaffioiants  at  aaoh  point  in  tiaa.  In 
addition,  tha  naoassary  aaaory  is  ainiaal.  Tha 
rectangular  window  in  (9)  raqulras  2 fewer 
aultiplioatlons,  but  in  axohanga  requires  aaaory 
proportional  to  M,  tha  window  width.  Tharafora, 
the  aaln  advantage  of  adaptive  astiaation  over  tha 
block  aathod  is  tha  reduced  ooaputation,  and  the 
reduced  storage  whan  using  tha  recursive  window. 
Tha  price  to  ba  paid  is  that  adaptive  astiaation  is 
nolsiar;  wa  view  tha  adaptive  astiaation  aathod  as 
an  approxiaation  to  tha  block  aathod.  Bxaaplas 
illustrating  tha  dlffaranca  batwaon  tha  two  aathods 
will  ba  given  in  tha  oonfaranca. 

An  algoritha  using  W2(n)  was  used  by  Itakura 
in  his  original  hardware  realisation  of  tha  lattice 
in  a speech  vocoder  systaa  [8].  A siailar  vocoder 
has  bean  designed  by  Kang  [91. 


where 

(15) 

can  ba 

vlawad  as  a single  "MssuraMnt*  at 

tlM 

n«1,  and  G-(n*1)  is  a gain  term  at  n«1  glvan  by: 
d_(n) 

(16) 

where 

d^(n)  > f^i(n)  ♦ g^i(n-l) 

(17a) 

and 

D^(n)  ■ 6D^(n-1)  ♦ d,(n). 

(17b) 

d-(n) 

My  be  interpreted  as  tha  instantaneous 

residual  variance,  while  D^(n)  is  tha  total 
variance. 

6.  ONB-HULTIPLIBR  UTTICB 

Tha  two-nultipllar  lattioa  in  Fig.  1 is  only 
ona  of  uny  possible  lattioa  laplaMntatlons  of  tha 
all'Saro  forward  and  backward  prediction  flltars. 
Sona  of  tha  laplaMntations  have  a single 
'’■ultipliar,  which  would  bo  useful  if  a ssallor 
nuabor  of  aultipllaa  is  desired.  Fig.  2 shows  ona 
such  laplaaentatlon.  Others  aay  ba  found  in  [10]. 

7.  A NEW  FAST  START-UP  BQUALIZBR 

As  an  application  to  tha  adaptive  lattioa  wa 
propose  a new  fast  start-up  equaliser.  This  will 
ba  useful  in  polling  applioatlons  where  tha  initial 
tiaa  for  tha  adaption  process  is  desired  to  ba  as 
saall  ss  possible.  Chang  [11]  proposed  an 
aquallsor  structure  that  reduces  tha  start-up  tiaa 
drastically.  The  general  fora  of  tha  struotura  is 
shown  in  Fig.  3.  The  tap  ooaffioiants  o.  ara 
adjusted  such  that  the  aoan-squaro  error  between 
y(n)  and  soaa  rafaranoa  signal  la  ainlalsad.  If 
tha  filters  ara  aalaotad  such  that  tha  signals 
s^(n)  ara  orthonoraal,  than  tha  tap  ooaffioiants  o^ 
can  ba  adjusted  to  their  optlaua  values  in  ona  step 
[11]. 


Uia  Intarnratation 


Using  W2(n)  and  tharafora  (11),  ona  can  show 
that  K^(n«-1)  for  this  spaolal  window  uy  ba  written 
as  an  update  on  K,(n); 


Kj,(n+1)-K^(n)- 


*^l(n)g|^(n)+t^j(n-l)fjj(n)  ^ (ij) 


where  fi(n)  is  glvan  recursively  by  (11b)  with  kan. 
For  tha  spaolal  ossa  Bal,  D(n)  inoraasas 
continuously  and  the  corraotloa  tar*  in  (13)  tends 
to  aero  as  n goes  to  infinity.  In  this  ossa  K, 
tends  to  its  optlaal  vslua  with  probsblllty  1, 
assusing  s stationary  signal.  For  b<1,  ona  can 
show  that  (13)  baocaas  identical  to  tha  UtS  lattioa 
astlMta  of  Griffiths  [7]  with  a step  site  a«  1-8. 


Bslsan  Filter  Intarnratation 


One  can  show  [6]  that  (13)  can  ba  rewritten  in 
tbs  fora  of  a lalass  flltort 

K.(a»1)  • R,(s)  • Q,(a»1)(K*(a»1).I,(s)]  (1«) 


In  tha  specif io  equaliser  proposed  by  Chang, 
tha  filter  signals  s^(n)  era  obtained  free  s(n)  by 
a linear  tranaforaation  g • Z.  g,  . where 
g a [s.(n)...S||(n)r,  X a [x(n)...x(n-H4'1)r , and  L 
is  an  HxN  transformation  Mtrix  that  obeys 

m^al,  (18) 

where  £.  is  tha  KxN  autocorrelation  Mtrix  of  tha 
signal  x(n).  Tha  signal  x(a)  Is  taken  bars  to  bo 
tha  iapulso  rtsponsa  of  the  ohannol.  The  solution 
chosen  fw  Z,  by  Chang  la  Z.  * Sl,  where 
& a fi,  is  a Mtrix  whose  ooluans  ara  tha 

orthonorsal  aiganvaotors  of  &,  and  la  a diagonal 
Mtrix  whose  alosants  ara  tha  jiganvaluas  of  Z.. 
However,  tha  use  of  £.  requires  Ir  coefficients  with 
an  equal  nunbar  of  sultlplias,  which  can  bacoM 
excessive  for  large  N.  Below,  wa  give  our  lattioa 
struotura  for  tha  fast  start-up  equaliser,  where 
the  number  of  ooaffioiants  is  only  a linear 
function  of  H. 

Instead  of  an  algoovector  decomposition  for  £* 
CM  cam  psrferm  am  L 8t  ^ daaemposltlsm,  whore  L 
a lower  trlamgwlor  Mtrta,  uaiag  the  Orem  BoSmidt 


9.  G.S.  Kang,  personal  cowunleatlon.  (Sea, 

G.S.  Kang,  "Linear  Predlotlve  Narrowband  Voloe 
Digitizer,"  Proe.  1974  EASCON  Cool., 
Washington,  D.C.,  pp. 51-56,  Got.  1974,  for  a 
description  of  the  hardware  systen.) 

10.  J.  Nakhoul,  "A  Class  of  All-Zaro  Lattioa 

Digital  Filters:  Properties  and  Applications," 
submitted  to  IEEE  Trans.  Acoustics,  Speech  and 
Signal  Processing,  Sept.  1977. 

11.  R.W.  Chang,  "A  New  Equalizer  Structure  for  Fast 
Start-up  Digital  Communication,"  Bell  Syst. 
Tech.  J.,  pp.  1969-2014,  July-Aug.  1971. 


orthogonalizatlon  process.  The  transformation  Lx 
turns  out  to  be  the  sequence  of  backward  residuals 
gj^(n),  which  are  orthogonal  to  each  other.  In 
particular,  we  have  [10] 


1 m-1  » 

for  the  two-multiplier  lattice  of  Fig.  1 , or 

1-K. 

for  the  one-multiplier  lattice  of  Fig.  2.  The 
final  equalizer  structure  Is  shown  in  Fig.  4 . The 
total  number  of  extra  coefficients  employed  Is 
3(N-1)  for  the  two-multipller  lattice  or  2(N-t)  for 
the  one-multlpller  lattice.  The  lattice  la  adapted 
to  the  channel  first,  as  described  In  Section  5, 
and  the  tap  coefficients  should  then  adapt  In  one 
step. 


Fig.  1 Basic  all-zero  lattice  filter 
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Fig.  3 Generalized  equalizer  structure 


Fig.  4 Lattice  fast  start-up  equaliser  structure 
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The  spectral  distortion  of  speech 
signals,  without  affecting  the  pitch  or  the 
speed  of  the  signal , has  met  with  some 
difficulty  due  to  the  need  for  pitch 
extraction.  This  paper  presents  a general 
analysis-synthesis  scheme  for  the  arbitrary 
spectral  distortion  of  speech  signals  without 
the  need  for  pitch  extraction.  Linear 
predictive  warping  ■ cepstral  warping . and 
autocorrelation  warping . are  given  as  examples 
of  the  general  scheme.  Applications  Include 
the  unscrambling  of  helium  speech,  spectral 
compression  for  the  hard  of  hearing,  bit  rate 
reduction  In  speech  compression  systems,  and 
efficiency  of  spectral  representation  for 
speech  recognition  systems. 

1 . Introduction 

Arbitrary  spectral  distortion  of  any 
finite  sampled  signal  can  be  easily 
accomplished  by  computing  the  discrete  Fourier 
transform  (DFT)  of  the  signal,  performing  the 
desired  spectral  distortion,  and  then  taking 
the  inverse  DFT.  (The  resulting  signal  is  a.i 
approximation  to  the  desired  spectrally 
distorted  signal  in  the  same  measure  as  the 
DFT  Is  an  approximation  to  the  z transform. 
Arbitrary  accuracy  can  be  achieved  by 
increasing  the  order  of  the  DFT.)  In  applying 
this  method  to  the  spectral  distortion  of 
voiced  speech  signals,  the  spectral  envelope 
is  distorted  as  well  as  the  voicing  (pitch) 
characteristics.  For  many  applications,  the 
distortion  Is  usually  desired  for  the  spectral 
envelope,  but  not  for  the  pitch.  Thus  It 
becomes  necessary  to  separate  the  pitch 
(source)  Information,  distort  the  spectral 
envelope,  and  then  resynthesize  using  the 
extracted  source  information. 

Certain  existing  research  systems  [1-3] 
for  the  nonlinear  spectral  distortion  of 
speech  signals  separate  the  source  Information 
by  making  voiced/unvoiced  decisions  and 
performing  pitch  extraction.  A different 
approach  was  taken  by  Suzuki  et  al . [A]  for 
the  unscrambling  of  helium  speech,  where  pitch 
extraction  was  not  vised.  In  their  work,  the 
source  information  was  obtained  as  the 
residual  signal  In  a linear  predictive 
analysis  of  the  speech  signal.  The  spectral 
distortion  was  performed  in  the  time  domain  on 
the  Impulse  response  of  the  all-pole  filter. 
However,  the  only  type  of  distortion  attempted 
was  a linear  one,  and  It  was  effected  by 
Interpolation  in  the  time  domain.  In  this 
paper  we  describe  a general  analysis-synthesis 
system  for  the  nonlinear  spectral  distortion 
of  speech  signals,  without  the  qmcI  C2C  Pitch 
extraction . The  generality  of  the  system  Is 
achieved  by  performing  the  spectral  distortion 
directly  In  the  frequency  domain.  Three 
methods,  linear  predictive  warping,  cepstral 
warping,  and  autocorrelation  warping,  are 
given  as  examples  of  the  general  scheme. 


2.  General 

The  general  analysis-synthesis  system  for 
spectral  distortion  Is  shown  In  Fig.  1.  The 
speech  signal  s(n)  is  passed  through  a filter 
whose  magnitude  frequency  response  Is  the 
Inverse  of  the  envelope  of  the  signal 
spectrum.  The  output  of  the  Inverse  filter  is 
the  residual  signal  e(n),  which  contains 
mainly  the  source  Information.  Since  all  the 
resonant  structure  of  the  signal  is  removed  by 
the  Inverse  filter,  e(n)  will  have  an 
essentially  flat  spectral  envelope.  The 
residual  signal  is  then  used  as  Input  to  a 
synthesis  filter  whose  magnitude  frequency 
response  is  equal  to  the  desired  distorted  or 
warped  spectral  characteristics.  The  output 
of  the  synthesis  filter,  s'(n).  Is  then  the 
transformed  signal  with  the  same  source 
characteristics  as  s(n),  but  with  a spectrum 
that  Is  a distorted  version  of  the  spectrum  of 
s(n) . 


One  Important  property  of  the  system  In 
Fig.  1 Is  that  the  sampling  rate  remains  fixed 
throughout  the  system.  If  for  some  reason  the 
sampling  rate  at  the  output  Is  desired  to  be 
different  from  that  at  the  Input,  then  one 
needs  to  perform  down  sampling  on  the  residual 
signal,  or  else  perform  pitch  extraction. 

There  remains  the  specification  of  the 
inverse  filter  and  synthesis  filter 
parameters.  This  is  described  next. 

3-  Nonparametrlc  Warping 

The  dashed  box  In  Fig.  1 shows  the 
general  scheme  for  spectral  warping  and  for 
the  specification  of  the  Inverse  and  synthesis 
filter  parameters.  A more  detailed  block 
diagram  Is  shown  in  Fig.  2.  The  spectrum  P(u) 
of  the  signal  s(n)  Is  computed  by  windowing 
the  signal  and  taking  the  magnitude  squared  of 
Its  Fourier  transform.  ?(>“)  is  then  smoothed 
to  retain  the  requisite  resonant  structure. 
The  smoothed  spectrum  P(u()  is  then  Inverted. 
The  resulting  Inverse  smoothed  spectrum  Is 
then  used  to  determine  the  Impulse  response 
a(n)  of  the  Inverse  filter  A(z).  Assuming  a 
minimum  phase  implementation,  a(n)  can  be 
computed  from  p"'(u)  through  the  use  of  the 
cepstrum.  Details  can  be  found  In  Oppenhelm 
and  Schafer  [5]. 

The  Impulse  response  h(n)  of  the 
synthesis  filter  H(z)  can  be  computed  using 
the  lower  branch  In  Fig.  2.  The  signal 
spectrum  Is  distorted  then  smoothed  to  obtain 
‘'(u).  Again,  assuming  a minimum  phase 
implementation,  h(n)  can  be  computed  from 
P'(u>)  using  the  cepstral  method.  An 
alternative  method  to  compute  ^'(i,))  is  shown 
by  the  dashed  llhS»  Ih  Fig-  where  the 
smoothed  spectrum  *(«)  is  directly  distorted. 
Note  that  the  two  alternative  methods  do  not 
result  in  identical  spectra  for  P'(w).  Which 
method  to  use  depends  on  the  particular 
appl Icat Ion . 
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Since  the  method  to  compute  the  miniraura 
phase  Impulse  response  from  a spectrum 
Involves  taking  the  DFT,  it  is  desirable  for 
efficiency  purposes  to  have  the  frequency 
values  in  the  spectrum  be  equally  spaced  and 
their  number  be  an  integral  power  of  2,  so 
that  one  can  make  use  of  the  FFT.  If  P(u)  is 
computed  using  the  FFT,  then  the  two 
conditions  can  be  easily  met  for  P"*(u)). 
However,  because  of  the  spectral  distortion  in 
the  lower  branch  of  Fig.  2,  the  spectral 
values  of  P'(“)  will  not  be  equally  spaced  in 
general.  By  simple  Interpolation  in  the 
frequency  domain,  the  spectral  values  can  be 
computed  at  equally  spaced  frequencies,  thus 
opening  the  way  to  the  use  of  the  FFT. 

The  smoothing  of  the  signal  spectrum  to 
obtain  tne  spectral  envelope  can  be  done  in 
many  different  ways.  Here,  we  give  two 
popular  nonparametrlc  methods  which  comprise 
two  of  the  three  methods  of  spectral  warping 
that  are  presented  in  the  paper.  A parametric 
method  is  given  in  the  next  section. 

In  this  method  the  spectrum  is  smoothed 
by  applying  a window  to  the  autocorrelation. 
This  is  the  well-known  method  of  spectral 
estimation  used  by  statisticians  [6]. 

Ceostral  Warping 

In  this  method  the  log  spectrum  is 
smoothed  by  applying  a window  to  the  cepstrum. 
This  method  of  spectral  smoothing  has  been 
used  extensively  in  speech  analysis  [5]. 

4 . Linear  Predictive  Warning 

We  have  called  the  types  of  spectral 
warping  in  the  previous  section 
"nonparametrlc"  because  no  specific  model  is 
used  to  determine  the  impulse  response  of  the 
inverre  and  synthesis  filters.  In  this 
section  we  use  the  all-pole  linear  prediction 
model  as  a basis  to  determine  the  parameters 
of  the  two  filte-'s.  Fig.  3 shows  a schematic 
diagram  of  linear  predictive  (LP)  warping. 
The  parameters  a(k)  of  the  inverse  filter  are 
simply  the  predictor  coefficients  which  are 
obtained  by  spectral  LP  [7]  as  a solution  to 
the  set  of  linear  equations: 


P 

I a(k)  R(i-K)  - - R(i),  Isiip,  (D 
k-1 


where  p is  the  number  of  poles  in  the  model , 
and  R(l)  is  the  autocorrelation  of  the  signal, 
which  can  be  computed  either  by  taking  the  FFT 
of  the  spectrum  P(u),  or  directly  from  the 
signal.  Note  that  the  method  of  spectral  LP 
inherently  smoothes  the  signal  spectrum,  with 
the  degree  of  smoothing  being  controlled  by 
the  number  of  poles  p.  Referring  to  Fig.  2, 
the  smoothed  spectrum  F(u)  in  this  case  is 
given  by  the  all-pole  model  spectrum: 


P(u))  . (2) 

|1+  Z a(k)e"^''"|^ 
k-1 


The  parameters  a'(k)  of  the  synthesis 
filter  are  obtained  as  a solution  to  a set  of 
equations  analogous  to  (1)  with  a(k),  R(i)  and 
p replaced  by  a'(k),  R'(i)  and  q, 
respectively,  where  R”(i)  is  the  Fourier 
transform  of  the  distorted  spectrum  P'(w),  and 
q is  the  number  of  poles  in  the  synthesis 
filter.  In  general,  q^p,  and  its  choice 
depends  on  the  application. 

The  parameters  a(k)  need  not  be  computed 
using  spectral  LP,  which  is  essentially 
equivalent  to  the  autocorrelation  method  of 
LP.  Instead,  one  could  use  the  covariance, 
lattice  or  covariance  lattice  methods  [8].  In 
that  case,  P(u))  is  undefined.  So  following 
the  dashed  line  in  Fig.  2,  we  compute  p(u)) 
from  (2),  distort  it,  then  apply  spectral  LP 
to  the  resulting  distorted  spectrum  in  order 
to  evaluate  the  coefficients  a'(k)  of  the 
synthesis  filter. 

5.  Applications 

There  are  many  possible  applications  for 
the  methods  of  nonlinear  spectral  warping 
given  above.  Below,  we  shall  give  four 
applications:  two  of  these  use  the  spectral 
warping  for  a more  efficient  representation  of 
the  spectrum,  and  two  are  analysis-synthesis 
systems  for  generating  speech  that  is 
spectrally  distorted. 

Efficiency  siT  Spectral  Representation 

In  applications  such  as  speech 
recognition  and  speech  compression,  it  is  more 
Important  to  represent  the  spectrum  accurately 
at  low  frequencies  (<3  kHz)  than  at  high 
frequencies  03  kHz).  Normally,  anywhere 
between  17-20  poles  are  needed  for  an  all-pole 
LP  representation  of  speech  spectra  with  a 
bandwidth  of  7.5  kHz  (sampling  frequency 
15  kHz).  Using  LP  warping,  for  example,  with 
frequencies  above  3 kHz  being  heavily  warped, 
one  could  have  a good  representation  using 
only  12-14  poles.  In  this  manner,  one  could 
still  perform  accurate  formant  extraction  for 
the  first  three  formants,  with  the  higher 
formants  being  represented  by  wide  spectral 
peaks,  which  is  all  that  is  usually  needed 
[9]. 

For  speech  compression,  this  enables  one 
to  have  wide-band,  high  quality  speech  at  low 
bit  rates,  since  fewer  coefficients  need  to  be 
transmitted.  This  idea  has  been  recently 
implemented  in  an  LPCW  vocoder:  an  LPC  vocoder 
with  spectral  warping  [10]. 

Unscrambling  aC  Sjeaa.<lll 

In  order  to  render  speech  spoken  in  a 
helium-oxygen  mixture  more  Intelligible,  it  is 
necessary  to  compress  the  bandwidth  from  about 
12  kHz  down  to  5 kHz.  In  addition  to  this 
linear  warping,  one  might  need  to  perform 
additional  nonlinear  warping  at  low 
frequencies  to  compensate  for  high  pressure 
effects  [1,2,4].  Heretofore,  such  nonlinear 
warping  had  not  been  possible. 

Since  the  bandwidth  is  reduced  to  5 kHz, 
one  must  still  define  values  for  the  spectrum 
between  5 and  12  kHz  (assuming  a 24  kHz 
sampling  frequency).  The  reason  is  that  in 
our  analysis-synthesis  system  the  sampling 


sa 


rate  remain  fixed.  It  Is  usually  sufficient 
to  assign  a positive  constant  for  the  spectrum 
between  5 and  12  kHz  that  Is  a fixed  number  of 
decibels  below  the  maximum  value  In  the 
spectrum.  A value  of  zero,  however,  Is  not 
recommended . 

Speech  c^r  iiac<l  tiuclas 

Hany  people  with  severe  hearing  loss 
cannot  hear  frequencies  much  above  1 kHz  [11]. 
An  Idea  that  some  researchers  have  had  Is  to 
compress  the  speech  spectrum  so  that  the  most 
Important  part  of  the  spectrum  (up  to  3 kHz) 
Is  compressed  down  to  less  than  1 kHz.  It  Is 
hoped  that  this  squeeze  of  the  spectral 
Information  Into  a small  bandwidth  would  aid 
the  hard  of  hearing  In  listening  to  speech, 
and  would  eventually  lead  to  the  design  of 
more  effective  hearing  aids.  It  Is  easy  to 
show  that  a simple  linear  compression  of  the 
spectrum  to  less  than  1 kHz  la  quite 
unintelligible.  However,  the  results  Improve 
dramatically  If  a nonlinear  warping  that 
emphasizes  low  frequencies  Is  effected. 

The  technical  details  for  this 
application  are  very  similar  to  those 
described  above  for  the  unscrambling  of  helium 
speech . 

6.  Conclusion 

A general  analysis-synthesis  system  for 
the  nonlinear  spectral  distortion  of  speech 
signals  was  described.  The  method  does  not 
need  any  pitch  extraction,  and  allows  for  the 
arbitrary  specification  of  the  warping 
function.  The  latter  Is  accomplished  by 
performing  the  warping  directly  in  the 
frequency  domain.  Depending  on  the  type  of 
spectral  smoothing  used,  three  methods 
resulted:  autocorrelation,  cepstral  and  linear 
predictive  warping.  Applications  for  these 
methods  Included  bit  rate  reduction  In  high 
quality  speech  compression  systems,  efficient 
spectral  representation  for  use  in  speech 
recognition  systems,  unscrambling  of  helium 
speech,  and  spectral  compression  for  the  hard 
of  hearing. 
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In  ordinary  linear  prediction  the  speech 
spectral  envelope  Is  modeled  by  an  all-pole 
spectrum.  The  error  criterion  employed 
guarantees  a uniform  fit  across  the  whole 
frequency  range.  However,  we  know  from  speech 
perception  studies  that  low  frequencies  are 
more  Important  than  high  frequencies  for 
perception.  Therefore,  a minimally  redundant 
model  would  strive  to  achieve  a uniform 
perceptual  fit  across  the  spectrum,  which 
means  that  It  should  be  able  to  represent  low 
frequencies  more  accurately  than  high 
frequencies.  This  Is  achieved  in  the  LPCW 
vocoder:  an  LPC  vocoder  employing  our  recently 
developed  method  of  linear  predictive  warping 
(LPW).  The  result  Is  improved  speech  quality 
for  the  same  bit  rate. 

1.  Introduction 

Narrow-band  LPC  vocoders  with 
transmission  rates  leas  than  i(800  bps  have 
generally  dealt  with  speech  sampled  at  less 
than  10  kHz  and  usually  closer  to  6.5  kHz. 
Since  the  bit  rate  needed  for  transmission  is 
roughly  proportional  to  the  sampling  rate.  It 
Is  argued  Justifiably  that  the  possible 
Increase  In  speech  Intelligibility  and  quality 
In  going  to  10  kHz  is  not  commensurate  with 
the  Increase  in  bit  rate,  and  so  sampling 
rates  closer  to  6.5  kHz  have  dominated  the 
vocoder  scene.  The  argument  can  also  be 
phrased  another  way.  If  the  bit  rate  is  to 
remain  fixed  (e.g.,  2400  bps),  then  an 
increasing  the  number  of  bits  for  each  frame 
means  that  one  is  forced  to  transmit  fewer 
frames  per  second.  Thus,  while  spectral 
fidelity  is  increased  for  each  transmitted 
frame,  the  accuracy  in  following  the  dynamic 
aspects  of  the  signal  is  decreased. 

Traditional  channel  vocoder  systems  have 
solved  this  problem  by  positioning  their 
filters  nonlinearly  such  that  more  filters  are 
at  low  frequencies  than  at  high  frequencies 
[1].  It  is  not  unusual  to  see  a filter  placed 
as  high  as  7 kHz  in  a channel  vocoder.  Thus, 
the  total  speech  bandwidth  represented  can  be 
about  7 kHz,  which  is  to  be  contrasted  with 
bandwldths  closer  to  3 WHz  in  LPC  vocoders. 
(It  should  not  be  concluded  from  this,  though, 
that  channel  vocoders  produce  higher  quality 
speech  than  LPC  vocoders  for  a given  bit 
rate. ) 

A hybrid  solution  was  Introduced  in  the 
TRIVOC  vocoder  [2],  whloh  used  an  LPC 
representation  at  low  frequencies  and  a 
channel  vocoder  at  higher  frequencies.  This, 
of  course,  has  the  disadvantage  of  having  to 
program  two  different  vocoder  systems. 

This  paper  presents  LPCW:  an  LPC  vocoder 
that  is  capable  of  representing  low 
frequencies  better  then  high  frequencies. 
This  suggests  the  possibility  of  wlde-band 
speech  st  low  bit  rstes. 


2.  Linear  Predictive  WarPlng 

The  idea  behind  LPCW  is  quite  simple: 
Warp  the  spectrum  such  that  high  frequencies 
are  compressed  relative  to  low  frequencies, 
then  apply  spectral  linear  prediction  [3]  to 
the  warped  spectrum.  Because  the  resulting 
representation  is  uniform  across  the  warped 
spectrum,  it  means  that  low  frequencies  are 
better  matched  than  higher  frequencies  since 
the  latter  are  compressed. 

The  procedure  for  computing  the 
coefficients  of  the  warped  spectrum  is  as 
follows : 

(a)  Window  the  signal  and  compute  its 
spectrum. 

(b)  Warp  the  spectrum  as  desired. 

(c)  Take  the  Fourier  transform  of  the  warped 
spectrum  to  get  the  autocorrelation  R(l). 

(d)  Solve  for  the  predictor  parameters  from 
the  normal  equations: 

I a(k)  R(i-k)  - - R(i).  l2i=P.  (1) 
k-l 

where  a(k)  are  the  predictor  coefficients 
and  p is  their  number.  The  reflection 
coefficients,  which  are  obtained  as  a 
byproduct  of  the  solution,  can  be 
converted  to  log  area  ratios,  then 
quantized  and  transmitted  [4]. 

In  warping  the  spectrum,  it  is  practical 
(because  of  FFT  algorithms)  to  compute  the 
spectral  values  at  equally  spaced  frequencies. 
This  can  be  done  by  simple  interpolation  from 
the  signal  spectral  values.  The 

autocorrelation  R(l)  can  then  be  computed  via 
the  FFT. 

The  procedure  given  above  for  linear 
predictive  warping  makes  use  of  the 
autocorrelation  method  of  linear  prediction 
[5].  If  the  analysis  is  done  using  the 
covariance,  lattice,  or  covariance  lattice  [6] 
methods,  then  the  procedure  has  to  be  modified 
as  follows:  after  solving  for  the  predictor 
coefficients,  compute  the  all-pole  model 
spectrum,  then  continue  the  procedure  starting 
at  step  (b)  above.  The  all-pole  model 
spectrum  is  given  by: 

P(w)  - — i . (2) 

|1+  Z a(k)e~^’'‘“|^ 
k-l 

3.  Spectral  Dewarpins 

At  the  receiver  of  the  LPCW  vocoder,  the 
received  parameters  are  decoded.  If  log  area 
ratios  are  received,  they  are  decoded  into 
reflection  coefficients,  which  are  converted 
in  turn  to  the  corresponding  predictor 
coefficients  using  a simple  recursive 
procedure  [5].  These  coefficients  correspond 


Since,  in  our  application,  spectra  are 
defined  in  the  z plane,  we  need  a warping 
function  on  the  angle  (which  correaponda  to 
frequency)  in  the  z plane.  This  Inpllea  that 
we  must  assume  a particular  sampling  frequency 
F.  Let 

<11  • 2n~  • original  angle  in  the  t plane 
^ corresponding  to  frequency  f, 

n « warped  angle  corresponding  to  f. 


to  the  warped  spectrum  and,  therefore,  cannot 
be  used  for  synthesis.  One  must  first  perform 
the  necessary  dewarplng. 

The  dewarplng  procedure  is  as  follows: 

(a)  Using  the  decoded  predictor  coefficients, 
compute  the  all-pole  model  spectrum  from 
(2). 

(b)  Dewarp  this  spectrum  using  the  Inverse  of 
the  function  used  in  the  original  warping. 

(c)  Take  the  Fourier  transform  of  the  dewarped 
spectrum  to  obtain  the  corresponding 
autocorrelation  function. 

(d)  Use  this  autocorrelation  function  in  (1) 
to  compute  the  predictor  coefficients  (and 
hence  the  reflection  coefficients) 
corresponding  to  the  dewarped  spectrum. 
The  number  of  poles  (predictor 
coefflcents)  here  can  be  as  large  as 
desired  to  approximate  the  dewarped 
spectrum. 

(e)  Synthesize  the  speech  waveform  using  these 
computed  coefficients. 


After  step  (b)  above  it  is  possible  to 
take  a different  route  to  obtain  the 
parameters  of  the  synthesis  filter.  Instead 
of  using  linear  prediction,  one  could  use  the 
cepstrum  [7]  to  compute  the  minimum  phase 
Impulse  response  whose  spectrum  is  identical 
to  the  dewarped  spectrum.  This  impulse 
response  is  then  used  for  the  synthesis 
filter. 

Discussion 

It  is  clear  from  the  dewarplng  procedures 
given  above  that  the  amount  of  processing 
needed  at  the  synthesizer  is  comparable  to 
that  of  the  analysis.  This  Increase  in 
computation  relative  to  a regular  LPC  vocoder 
la  certainly  a disadvantage.  Whether  the 
extra  expense  is  Justified  or  not  depends  on 
the  benefits  achieved.  For  a given  bit  rate, 
the  main  benefit  is  an  increase  in  the  speech 
bandwidth  representable  using  the  same  bit 
rate.  This  Increase  in  bandwidth  la  on  the 
order  of  50J. 

4.  naraloK  FunsUgng 

Since  linear  predictive  (LP)  warping 
allows  for  arbitrary  warping  of  the  spectrum, 
one  must  choose  a warping  function  appropriate 
for  vocoder  purposes.  One  reasonable  warping 
function  would  transform  the  linear  frequency 
scale  to  the  mel  scale  [8],  which  compresses 
high  frequencies  relative  to  low  frequencies. 

The  relation  between  the  mel  scale  and 
frequency  la  shown  in  Fig.  1,  which  shows  how 
subjective  pitch  (Ih  mels)  is  related  to 
frequency  (in  Hz)  for  pure  tones  up  to  5 kHz. 
This  relation  is  similar  to  those  of  critical 
band  masking  effects  and  equal  intelligibility 
curves  [8].  The  mel-frequeney  relation  can  be 
approximated  by  the  following  equation 

m - 2595  logj^Q (1+^^)  r (3) 

where  f is  the  frequency  in  Hz  and  m is  the 
pitch  in  mels.  The  mel  scale  is  adjusted  such 
that  fflilOOO  mels  corresponds  to  filOOO  Hz. 


The  warping  function  (2  is  obtained  from  (3)  by 
setting  J3  = * for  w = ir  or  f»F/2,  half  the 
sampling  frequency.  The  result  is: 


Note  that  the  warping  function  in  (4)  is 
defined  only  up  to  fiF/2.  For  F/21fiF,  the 
function  is  taken  to  be  the  mirror  image  about 
the  real  axis.  The  mel  warping  function  is 
plotted  in  Fig.  2 for  a sampling  frequency 
FxlO  kHz,  which  corresponds  to  a speech 
bandwidth  of  5 kHz. 

The  mel  warping  function  could  be  used 
very  profitably  with  a homomorphic  vocoder  [9] 
which  employs  cepstral  warping  or 
autocorrelation  warping  ClO]..  However,  using 
the  mel  function  with  an  LPCW  vocoder  seems  to 
give  unsatisfactory  results.  We  believe  the 
reason  to  be  as  follows.  For  LP  to  give  best 
results,  it  is  important  that  the  all-pole 
model  is  well  suited  to  the  signal  spectrum, 
which  is  true  for  a large  and  perceptually 
important  class  of  speech  spectra.  If  the 
signal  spectrum  is  warped  nonllnearly,  then 
the  all-pole  model  ceases  to  be  a good 
spectral  model.  Therefore,  the  results  are 
bound  to  be  less  than  satisfactory.  Note  that 
this  problem  does  not  affect  cepstral  warping 
results,  since  cepstral  warping  is  not  based 
on  a specific  model. 

The  solution  we  offer  to  this  problem  in 
an  LPCW  vocoder  is  to  have  a warping  function 
that  la  as  linear  as  possible  in  the  frequency 
range  where  the  all-pole  model  is  important, 
e.g.  up  to  the  third  formant  region.  For 
higher  frequencies  the  function  can  be  quite 
nonlinear  since  only  a rough  estimate  of  the 
spectrum  at  those  frequencies  is  needed. 
Fig.  2 shows  a sine  warping  function 

n - a 8in(^|),  0<f<|  , (5) 

which,  for  FxiO  kHz,  is  nearly  linear  up  to 
2.5  kHz,  and  very  nonlinear  above  that.  One 
could,  of  course,  design  other  warping 
functions  that  comprise  more  than  a single 
curve . 

5.  Examples 

Figs.  3 and  4 show  two  examples  of  using 
the  sine  warping  function  with  spectra  of  the 
vowel  [o]  and  the  fricative  [s],  respectively. 
In  each  of  the  two  examples.  Fig.  a is  a 
12-pole  fit  to  the  original  spectrum.  Fig.  b 
is  a 9-pole  fit  to  the  warped  spectrum  (shown 
after  dewarplng),  and  Fig.  c is  a 9-pole  fit 
to  the  original  spectrum.  Note  the  greater 
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detail  In  the  first  formant  region  in  Fig.  3b 
as  compared  to  Fig.  3^1  while  the  high 
frequency  region  is  not  matched  as  well  in 
Fig.  3b.  In  comparing  the  two  9-pole  fits, 
Figs,  3b  and  3c,  there  is  no  doubt  that 
Fig.  3b  is  a better  "perceptual"  fit  to  the 
spectrum,  since  the  first  three  formants  in 
Fig.  3b  are  better  matched  than  in  Fig.  3c. 
In  contrast.  Fig.  4c  seems  to  be  a better  fit 
to  the  spectrum  than  4b.  But  for  a fricative, 
the  match  of  Fig.  4b  might  be  enough  for  good 
quality  resynthesis. 

These  examples  demonstrate  that  the  use 
of  spectral  warping  with  an  LPC  vocoder  could 
lead  to  a more  efficient  representation  of  the 
spectrum  for  the  same  speech  quality. 
Although  it  might  be  practical  to  employ  a 
fixed  warping  function  for  all  situations,  it 
is  certainly  possible  to  use  several  warping 
functions  for  different  types  of  spectra. 
However,  it  is  not  clear  that  the  possible 
increase  in  efficiency  is  worth  the  extra 
cost . 


6.  Conclusions 

LPCU,  an  LPC  vocoder  with  LP  spectral 
warping,  has  been  proposed.  In  this  vocoder, 
a spectral  warping  function  is  used  to 
compress  high  frequencies  relative  to  low 
frequencies;  a technique  which  is  hypothesized 
to  accommodate  wider  band  speech  signals.  The 
result  is  Improved  speech  quality  for  the  same 
transmission  rates. 
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Fig.  1.  Subjective  pitch  versus  frequency. 
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Fig.  2.  M«I  warping  and  aina  warping 
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Fig.  3.  U>  spectra  for  the  vowel  [o]. 

a)  Original,  12-pole 

b)  Warped,  9-pole  (sho«m  dewarped) 

c)  Original,  9-pole. 
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Fig.  4.  LP  spectra  for  the  fricative  [s], 

a)  Original,  12-pole 

b)  Harped,  9-pole  (shown  dewarped) 

c)  Original,  9-pole. 
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A functional  perceptual  aod^  la 
considered  in  which  continuous  spfcah  is 
represented  in  terms  of  speech  par^uters 
extracted  at  a minimal  set  of  time  {bints 
or  frames,  not  necessarily  equal%  sp^ed, 
in  such  a way  that  the  peroelveA  qucAity 
of  the  resynthesized  speech  Is^p  worse 
than  that  of  the  full,  unvduced, 
parameter  data  from  tihich  thA^aodel 
parameter  values  are  derived.^  The 
validity  of  this  model  has  been 
experimentally  demonstrated  by  thib||^rk  of 
Olive  and  Spiokenagel,  ulM^'  a 
phoneme-based,  nonautomatic,  met^dt  In 
this  paper,  we  describe  the  results  of  our 
work  towards  developing  a fqlly  automatic 
scheme  for  perceptual  mod<^ing  oP*’epeech 
parametrically  represented  by  LPC 
parameters.  We  present  the  model  and  the 
automatic  scheme  from  the  viewpoint  of 
their  application  to  efficient,  variable 
frame  rate,  narrowband  speech 
transmission . 


1 . Perceptual  Model 

In  our  work  on  developing  minimally 
redundant  narrowband  speech  transmission 
systems,  we  have  used  quite  suocessfully 
the  oonoept  of  variable  frame  rate  (VFR) 
transmission  [1,2,11,12].  In  a VFR 
scheme,  model  parameters  (LPC  parameters, 
log  pitch,  log  gain)  are  transmitted  only 
when  the  properties  of  the  speech  signal 
have  changed  sufficiently  since  the 
preceding  transmission;  the  parameters  for 
the  untransmitted  frames  are  regenerated 
at  the  receiver  through  linear 
Interpolation  between  the  parameters  of 
the  two  adjacent  transmitted  frames.  For 
example,  speech  parameters  may  be 
transmitted  less  often  during  steady-state 
portions  of  speech,  and  more  often  during 
rapid  speech  transitions. 

The  oonoept  of  VFR  transmission  was 
applied  to  formant  parameters  of  speech  by 
MoLsrnon  et  si  [33.  Md  to  LPC  parameters 
of  speech  [U]  by  Magill  [5]  and  by  two  of 
the  present  authors  [1,2].  For  VFR 
transmission  of  LPC  parameters,  the  log 
likelihood  ratio  measure  of  Itakura  [6] 


has  been  used  for  deciding  which  frames  to 
transmit.  In  our  work,  linear  predictive 
analysis  was  done  once  every  10  ms  on  I ; 

speech,  low-pass  filtered  at  <i  kHz  and 
sampled  at  10  kHz,  to  extract  100  i:  ) 

frames/sac  (fps)  of  LPC  data.  Using  the  [ | 

VFR  scheme,  we  reduced  the  average  frame  ^ ^ 

rate  of  transmission  of  LPC  data  j 

(excluding  pitch  and  gain,  which  were  j 

transmitted  at  the  full  fixed  rate  of  100 
fps)  to  about  37  fps,  with  only  a small  J 

change  in  the  quality  of  the  resynthesized  I 

speech  relative  to  the  case  when  all  the  i 

available  100  fps  data  were  transmitted.  ( 

Further,  we  observed  that  any  significant  j 

reduction  in  the  frame  rate  below  37  fps  j 

introduced,  in  general,  noticeable  , j 

distortions  in  the  speech  quality.  ^ 

In  an  effort  to  reduce  the  average 
frame  rate  further,  without  speech  quality 
degradation,  we  have  recently  based  our 
work  on  VFR  transmission  on  the  following 
functional  perceptual  model  of  speech: 

1 ) Speech  can  be  represented  in  terms 
of  LPC  (or  other)  parameters  extracted  at 
a minimal  set  of  perceptually  significant 
time  points  (or  frames),  not  necessarily 
equally  spaced. 

2)  Between  any  two  such  time  points, 
the  parameters  vary  linearly. 

3)  The  location  of  these  points  is 
obtained  independently  for  pitch,  gain, 
and  spectral  (or  LPC)  parameters. 

Our  requirement  is  that  the  quality  of  the 
resynthesized  speech  based  on  this  model 
should  be  no  worse  than  that  of  the 
unreduced  or  the  full  100  fps  case.  The 
question  then  is:  Nhat  is  the  minimal  set 
of  perceptually  significant  frames  for  LPC 
parameters  that  is  consistent  with  our 
quality  requirement?  (While  we  have  used 
an  operational  definition  of  minimal  sets 
for  pitch  and  gain  in  our  work  [11,12],  we 
shall  only  discuss  the  LPC  parameters  in 
this  paper.)  The  recent  work  of  Olive  and 
Spiokenagel  [7]  suggests  that  the  minimal 
set  for  LPC  parameters  is  on  the  order  of 
taa  frames  fige  phoneme . (This  corresponds 
to  about  2A  fps,  assuming  an  average 
speech  rate  of  12  phonemes/sec.)  In  their 
work,  they  used  a manual,  trial-and-error 
scheme  to  looate  the  minimal 
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representative  set,  using  LPC  area 
parameters.  Pair-wise  comparison  tests 
between  the  resynthesized  speech  at  the 
resulting  rate  and  that  at  the  full  100 
fps  rate  indicated  no  significant 
differences  in  perceived  quality. 
Although  the  method  described  by  Olive  and 
Spickenagel  was  not  automatic  and  involved 
trial  and  error  adjustments,  their  work, 
nonetheless,  provides  an  "existence  proof" 
for  what  we  call  a perceptual  model  of 
speech,  and  supplies  a reasonable  lower 
limit  to  the  average  frame  rate.  (Very 
recently,  Olive  [9]  reported  on  a 
semiautomatic  method  that  employs  an 
iterative  frame  elimination  procedure  over 
individual  sentences,  and  which  was  found 
to  yield  about  the  same  performance  as  did 
a is  manual  approach  [7].) 

The  goal  of  the  work  reported  in  this 
paper  is  to  develop  a fully  automatic  VFH 
scheme  that  (1)  uses  the  information  about 
the  transmission  parameters  only,  (2) 
results  in  an  average  frame  rate  that  is 
close  to  the  above-mentioned  lower  limit, 
and  (3)  produces  speech  whose  quality  is 
no  worse  than  that  of  the  speech 
synthesized  using  the  full  100  fps  LPC 
data.  Towards  achieving  this  goal,  we 
first  investigated  a manual  or 
nonautomatic  scheme  (Section  2),  and  then 
developed  an  automatic  scheme  based  on  the 
results  and  experience  gained  from  the 
manual  procedure  (Section  3). 

2.  Manual 

The  main  purpose  of  the  manual  scheme 
described  below  was  to  gain  insights  and 
ideas  for  developing  transmission  criteria 
for  automatic  perceptual  modeling  based  on 
the  information  about  the  transmission 
parameters  only.  As  transmission 
parameters,  we  used  log  pitch,  log  gain, 
and  log  area  ratios  (LARs).  (For  the  many 
desirable  properties  of  LARs,  see  [8].) 
In  addition,  we  hoped  that  the  results  of 
our  manual  scheme  would  serve  as  another 
experimental  validation  of  the  perceptual 
modeling  hypothesis. 

As  a key  tool  for  manually  carrying 
out  the  perceptual  modeling  task,  we 
developed  an  interactive  display  program 
on  our  PDP-10/IMLAC  PDS-1  computer 
facility.  The  program  displays  all  the 
transmission  parameters  as  well  as  the 
transmission  status  (0  or  1)  of  each  of 
these  parameters  for  every  analysis  frame, 
as  functions  of  frame  number.  For  any 
desired  frame,  the  program  can  also 
display  the  values  of  displayed 
parameters,  the  power  spectrum  of  the 
linear  prediction  filter,  and  the  speech 
waveform  in  that  frame.  By  viewing  the 
displayed  information  for  several 
utterances,  one  gains  an  intuitive  feel 
for  the  magnitudes  of  parameter  variation 


under  various  speech  events  and  starts  to 
'develop  simple  rules  that  may  be  used  in 
deciding  whether  or  not  a given  frame  of 
lata  should  be  transmitted.  To  further 
did  the  user,  we  Incorporated  a number  of 
feat^ires  ^a^  allow  the  user  to  1 ) 
manually  mark  selected  frames  of  analyzed 
data  for  transmission,  2)  synthesize 
speech  from  a specified  amount  of 
transmitted  data,  and  3)  play  out  through 
a D/A  converter  specified  portion  of 
either  synthesized  speech  or  natural 
speech  or  both  for  on-line  evaluation  of 
relative  speech  quality. 

Using  the  above  interactive  program, 
we  accomplished  the  task  of  manually 
deriving  the  minimal  set  of  frames  for 
several  utterances.  Pitch,  gain  and  I't 
log  area  ratios,  using  the  autocorrelation 
method  [U]  of  linear  prediction,  were 
computed  at  a rate  of  100  fps  from  speech 
sampled  at  10  kHz.  We  selected  a minimum 
number  of  frames  of  LAR  data  for 
transmission,  out  of  the  available  100  fps 
analysis  data,  using  only  the  information 
about  the  transmission  parameters  and 
employing  rules  such  as  the  following: 

1)  when  log  area  ratios  change  roughly 
linearly,  transmit  them  only  for  the 
frames  corresponding  to  the  endpoints 
of  the  line,  since  the  LARs  for  the 
Intermediate  frames  will  be  generated 
at  the  receiver  through  linear 

interpolation,  and 

2}  ignore  or  deemphasize  large  changes  in 
the  values  of  LARs  when  the  associated 
filter  gain  is  low,  since  these 
low-gain  frames  have  a relatively 

small  effect  on  perception. 

The  overall  objective  was  to  reduce  the 
frame  rate  as  much  as  possible  with  the 
constraint  that  the  resynthesized  speech 
should  be  almost  Indistinguishable  (as 
judged  by  informal  listening  tests)  from 
the  speech  synthesized  with  all  the 

analysis  frames  of  data  transmitted.  We 
achieved  a minimum  frame  rate  of  about  27 
fps  on  the  average,  computed  over  7 
sentences  of>  continuous  speech  from  4 

speakers.  In  terms  of  phonemes,  this  rate 
came  to  about  2.2  frames/phoneme,  which  is 
slightly  higher  than  the  rate  that  Olive 
and  Spickenagel  reported  [7]. 

Figure  1 shows  the  time  plots  of 
Ditch  in  Hz  (FO),  speech  signal  energy  per 
sample  in  dB  (RO)  and  the  first  four  LARs 
in  dB  (G1-G4),  for  the  utterance  "The 
trouble  with  swimming  is  that  you  can 
drown",  spoken  by  a female.  The  long 
vertical  lines  mark  the  frames  selected 
for  transmission  using  the  above  manual 
approach . 


3.  Ad  Automatic  Scheme 
After  gaining  confidence  in  our 
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manual  approach,  wa  developed  an  automatic 
aoheme  for  selecting  frames  of  LAR  data 
for  transmission,  based  on  the  results  and 
experience  gained  from  the  manual  scheme 
discussed  above.  As  In  the  manual  scheme, 
the  automatic  scheme  uses  only  the 
Information  about  the  transmission 
parameters.  An  outline  of  the  automatic 
scheme  la  presented  below. 

The  automatic  scheme  employs  a 
two-stage  procedure  for  selecting  frames 
for  transmission.  In  the  first  stage,  a 
chunk  of  successive  analysis  frames  of 
data  are  considered;  the  number  of  frames 
In  the  chunk  Is  variable,  but  Its  maximum 
can  be  specified.  In  the  synthesis 
experiments  discussed  later,  we  chose  a 
maximum  of  9 frames.  The  decision  to 
transmit  a frame  of  data  Is  made  In  the 
first  stage  as  follows.  Assume  that  frame 
n In  the  current  chunk  has  been  marked  for 
transmission,  and  that  frame  (n«^m)  Is 
under  consideration.  For  each  of  the 
(m-1)  frames  that  lie  between  frames  n and 
(n-fm),  and  considering  the  first  LAR,  G1 , 
we  compute  the  error  between  the  actual 
value  of  G1  and  the  value  obtained  from 
linear  Interpolation  between  frames  n and 
(n-t-m).  These  (m-1)  errors  are  squared, 
weighted  first  by  the  speech  signal  energy 
(In  dB)  of  the  corresponding  frame  and 
then  by  a quantity  which  depends  Inversely 
upon  the  local  rate  of  change  of  G1,  and 
then  finally  averaged.  This  weighted 
average  error  Is  compared  against  a 
threshold.  If  the  threshold  Is  exceeded, 
frame  (n>m-1)  Is  marked  for  transmission; 
If  not,  the  above  procedure  Is  repeated 
for  G2,  etc.  The  scheme  considers  up  to 
G4  only;  If  the  error  does  not  exceed  the 
threshold  for  all  four  LARs,  It  advances 
to  frame  (n^fm-t-l)  and  the  entire  procedure 
Is  repeated.  Of  course.  If  a frame  Is 
marked  for  transmission,  all  the  LARs  are 
simultaneously  transmitted. 

The  second  stage  of  the  automatic 
scheme  considers  the  last  transmitted 
frame  In  the  previous  chunk  and  those 
frames  In  the  present  chunk  that  have  been 
marked  for  transmission,  and  attempts  to 
eliminate  any  unnecessary  transmissions. 
The  decision  procedure  employed  In  the 
second  stage  Is  the  same  as  In  the  first 
stage,  except  that  now  the  time-averaged 
error  la  also  averaged  over  the  first  four 
LARs.  Our  experiments  Indicated  that  the 
second  stage  deleted  about  10<  of  the 
transmission  marks  decided  by  the  first 
stage . 

It  should  be  pointed  out  that  the 
choice  of  the  various  values  of  the 
weighting  functions  and  the  thresholds 
Involved  extensive  experimentation;  we 
optimized  the  choice  by  comparing  against 
the  transmission  marks  that  were  manually 


obtained,  and  by  listening  to  the 
resulting  synthesized  speech. 

We  used  the  automatic  scheme  over  the 
same  speech  utterances  that  we 
experimented  with  In  our  manual  perceptual 
modeling  approach.  The  average  frame  rate 
of  LAR  transmission  with  the  automatic 
scheme  came  out  to  be  about  26  fps. 
Although  the  average  frame  rates  were 
approximately  the  same  for  the  manual  and 
automatic  schemes,  the  actual  locations  of 
transmitted  frames  were.  In  general, 
different  for  the  two  oases.  Informal 
listening  tests  conducted  on  the  syntheses 
obtained  from  the  manual  and  the  automatic 
perceptual  modeling  approaches  and  from 
the  fixed  100  fps  system  Indicated  that 
they  all  have  roughly  the  same  overall 
quality.  An  experienced  listener  could, 
for  some  utterances,  pick  the  synthesis 
from  the  automatic  scheme  as  being 
slightly  Inferior  to  the  syntheses  from 
the  other  two  systems.  We  plan  to  modify 
some  of  the  details  of  the  automatic 
scheme  to  enhance  the  quality  of  the 
resynthesized  speech. 

1* . Discussion 

In  all  the  synthesis  experiments 
reported  above,  pitch  and  gain  were 
transmitted  at  the  full  100  fps  rate  and 
none  of  the  transmission  parameters  were 
quantized.  To  investigate  the  perceptual 
model  under  parameter  quantization  and 
under  conditions  of  narrowband  speech 
transmission,  speech  was  preemphasized  and 
analyzed  using  an  11-th  order  LPC  model; 
pitch  and  gain  were  quantized 
logarithmically  using  6 bits  and  5 bits, 
respectively;  LARs  were  quantized  using 
about  44  bits/frame  [2].  Comparisons  of 
syntheses  from  the  fixed  100  fps  system 
and  the  VFR  system  based  on  the  perceptual 
model  (manual  or  automatic  scheme) 
Indicated  the  following  interesting,  and 
perhaps  surprising,  result:  For 
utterances  for  which  LPC  parameters  vary 
relatively  slowly  In  time  (e.g.,  "Why  were 
you  away  a year,  Roy7"),  the  syntheses 
from  the  100  fps  system  sounded  worse.  In 
particular  had  a more  "wobble"  quality, 
than  those  from  the  VFR  system.  We  had 
had  the  same  experience  with  our  earlier 
VFR  scheme  that  uses  the  log  likelihood 
ratio  criterion.  Our  explanation  for  the 
observed  quality  difference  Is  that  for 
slowly  varying  utterances,  the  error  due 
to  parameter  quantization  Is  more  than  the 
error  due  to  Interpolation.  It  Is  due  to 
the  above  result  that  we  required  of  the 
perceptually  modeled  speech  to  have  a 
perceived  quality  that  Is  aa  worse  than 
that  of  the  unreduced  system  from  which  It 
is  derived,  since  we  have  seen  that  a 
lower  rate  VFR  system  can  sometimes  sound 
better  than  the  unreduced  system. 
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The  perceptual  model  described  In 
this  paper  deals  with  the  problem  of 
parametric  representation  of  speech  in 
time,  which  Is  perhaps  the  most  Important 
aap%:t  of  the  ^verall  problem  of 
redundancy  removal  In  speech  as  suggested 
by  the  results  of  a recent  quality  test 
[10].  We  have  Investigated  In  detail 
several  other  aspects,  which  Include 
variable  order  linear  prediction  [2,**], 
optimal  parameter  quantization  and  bit 
allocation  [8] , and  Huffman  coding  of 
quantized  parameters  [2].  More  recently, 
we  have  proposed  VFR  schemes  for  pitch  and 
gain  also  [11].  When  we  employed  all 
these  compression  techniques  (without 
Huffman  coding)  with  the  automatic  VFR 
scheme  given  in  Section  3,  we  obtained 
good  quality  speech  at  average  bit  rates 
as  low  as  1700  bps,  measured  for 
continuous  speech.  With  Huffman  coding, 
we  expect  the  average  rate  to  drop  below 
1400  bps  with  absolutely  no  further 
reduction  In  quality. 

Although  we  discussed  above  the  VFR 
scheme  as  applied  to  efflolent  speech 
transmission,  the  scheme  may  be  used  in 
other  applications  such  as  speech  storage 
and  retrieval,  speech  synthesis  by  rule 
(as  in  the  work  of  Olive  and  Spickenagel 
[7D,  or  for  segmentation  purposes  in 
speech  recognition. 
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Flqure  1.  Time  plots  of  tranamlssion 
parameters  for  a sentence, 
alonq  with  transmission  marks 
(long  vertical  lines)  obtained 
by  the  manual  scheme. 
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mTRACT 

Thla  papar  praaanta  an  axeltatlon  aourca  nodal 
for  apaaek  ooapraaalon  and  apathaatai  Mhleb  alloaa 
for  a da|r—  of  Yolelnt  by  alalnt  voload  (pulaa) 
and  unvoioad  (nolaa)  aaoltatlona  in  a 
fraouaney-aalaotlva  nannar.  Ttia  ala  la  aohlavad  by 
dlyidtni  tba  apaaoh  apaotrun  Into  two  raglona,  with 
tha  pulaa  aourea  axolttni  tba  low-fraquanoy  raglon 
and  tha  nolaa  aouroa  aaoltlna  tha  hlBh-fraqvianoy 
ration.  R paraaatar  f.  datarainaa  tha  dagraa  of 
yoloing  by  apaeifying  tha  out-off  fraquanoy  batuaan 
tha  yoioad  and  unvoiead  ragiona.  For  apaaoh 
ooapraaaloa  applioationoi  F.  oan  ba  axtraotad 
autoaatioally  froa  tha  apaaoh  apaotrua  and 
tranaaittad.  Bxpariaanta  ualng  tha  naw  aodal 
indieata  ita  powar  in  aynthaaialng  natural  aounding 
yoioad  frieativaa,  and  in  largaly  allalnating  tha 
*buaay*  quality  of  vooodad  apaaoh.  k funetional 
dafinltlon  of  busalnaaa  and  naturalnaaa  la  givan  in 
tama  of  tha  aodal. 


1.  INTRODUCTION 

Forhapa  tha  aingla  aoat  laportant  daoialon  to 
ba  aada  in  a pitoh>axoitad  apaaoh  ooapraaalon 
ayataa  (voeodar)  la  tha  yoload/unvoload  (V/U) 
daoialon.  Brrora  in  thia  daoialon  ara  raadily 
paraaiyad  by  tha  aar  aa  a dagradation  of  apaaoh 
quality,  and  nay  alao  ba  aoooapanlad  by  a loaa  in 
intalllgibility.  Tat,  aaan  if  tha  V/U  daoialon 
wora  aoaahow  to  ba  aada  "parfaotly*,  tha  aynthatio 
apaaoh  would  oontlnua  to  axhibit  a diatinot  lack  of 
naturalnaaa,  axaapllflad  by  a oartain  "buiainaaa* 
and  a *laok  of  fiillnaaa.*  Thaaa  oharaotariatloa 
ara  aynatoaa  of  tha  inadaquaoy  of  tha  binary  V/U 
oaeitation  aodal. 

Thla  papar  axploraa  tha  akoitation  problaa  in 
apaaoh  aynthaala  and  praaanta  a alapla  alxad«aouroa 
aodal  that  allowa  for  a dllCHl  of  voloing.  Tha  naw 
aodal  la  oapabla  of  oroduolag  aora  natural  Bounding 
apaaoh!  It  aaaaa  to  largaly  allalaata  tha  buaalnaaa 
problaa  and  raeovar  auoh  or  tha  fullnaaa  la  tha 
apaaoh.  In  addition,  It  proalaaa  to  roduoo  tha 
adyaraa  affaota  of  yololng  arrora.  k ray law  of 
pravioua  raaaaroh  ralatii^  to  thla  aodal  is  givan 
in  a latar  aaotion. 

».  BASIC  STNTNBSIS  NOOgL  AND  TtRNINOLOGT 

Throughout  thla  papar,  wa  ahall  aaauaa  tha 
baalo  aynthaala  aodal  ahown  In  Fig.  1.  In  thla 
aodal,  a tlaa-yarylng  axoitatlon  algnal  aaoitaa  a 
tlna-yarytng  apaotral  ahaping  flltar,  tha  output  of 
whloh  la  tha  aynthatio  apaaoh.  Tha  aaoitation 
algnal  la  aaauaad  to  hava  a flat  anaatrua.  ao  that 
tha  aaaotral  anwalopa  of  tha  aynthatio  apaaoh  la 
datamlaad  ooaplatoly  by  tha  apaotral  ahaping 
flltar.  Furtharaora,  wa  ahall  aaauaa  thla  aodal  to 


hold  for  any  typo  of  aynthaala,  whathar  aa  part  of 
a vooodar  ayataa  or  a amthaala  ayataa.  In  fact, 
wa  wlah  to  argua  balow  that  our  propoaad  aouroa 
nodal  la  indaad  adaquata  for  both  applloationa. 

Raatrioting  tha  aaoitation  to  hava  a flat 
apaotrua  naoaaaarily  llaita  ua  to  two  typaa  of 
aaoitations  datarainlatlo  (pulaa)  or  randoa 
(noiaa). 

a)  Fulaa  aouroa  (Buat) 

Tha  dataralniatio  aaoitation  la,  in  ganaral, 
tha  iapulaa  raaponaa  of  an  all-paaa  flltar,  whloh 
wa  ahall  call  an  all-paaa  algnal  or  pulaa.  Tha 
aoat  trivial  fora  of  an  all-paaa  pulaa  la  a aingla 
iapulaa.  Vhan  tha  pulaa  aouroa  pr^uoaa  a aaquanoa 
of  pulaaa  aaparatad  by  a pltoh  pariod,  it  la  known 
aa  a Buu  aouroa.  (Nota  that  a aingla  pulaa  oould 
ba  uaad  in  tha  aynthaala  of  tha  burat  in  a ploalva 
aound  [1].  Howavar,  tha  burst  oan  also  ba 
synthaaiaad  using  tha  nolaa  aouroa.  No  ahall 
aaauaa  tha  lattar  in  thla  papar;  tha  pulaa  aouroa 
will  ba  uaad  axolusivaly  for  buaa  axoitatlon.) 

b)  Nolaa  Sgucgg  (Niaa) 

Tha  randoa  nolaa  axoitatlon  aay  ba  tha  output 
of  a randoa  nuabar  ganarator.  Qwaratora  with 
aithar  a unifora  or  Oauaaian  probability 
dlatributlon  ara  raadily  availabla  and  ara  quita 
adaquata.  Tha  nolaa  aouroa  la  also  known  aa  a ULiS 
aouroa. 

Vhathar  tha  aotual  axoitatlon  la  buaa  or  hlaa, 
or  a ooabination  of  tha  two,  ona  auat  always  aaka 
aura  that  tha  axeltatlon  haa  a flat  apaotrua.  Wa 
ahall  now  daaorlba  how  ona  night  dariva  an 
appropriata  aouroa  nodal  by  inapaoting  ahort-tina 
apaaoh  apaotra. 

3.  THB  "lOBAL*  SOURCB 

For  aeaa  partioular  apaaoh  signal,  ona  oan 
ranova  tha  ahort-tina  apaotral  anvaiopa  by 
appropriata  invaraa  flltaring,  aa  shown  in  Fig.  t. 
Tha  invaraa  flltar  A(s)  oan  ba  obtalnad  by  oapatral 
taohniquaa  t>]  or  through  tha  uaa  of  linaar 
pradiotlon  (3).  Tha  rasidual  signal  a(t)  will  than 
hava  a noninally  flat  apaotrua.  If  In  Fig.  1,  tha 
axoitatlon  u(t)  la  idantieal  to  tha  raaidual  a(t), 
and  tha  aynthaala  flltar  N(b)  la  tha  invaraa  of 
A(b),  than  tha  aynthatio  apaaoh  a'(t)  will  ba 
idantieal  to  tha  original  signal  s(t), 

Nowavar,  for  aynthaala  purposas,  tha  aynthatio 
algnal  naad  only  gpujuL  ilka  tha  original,  and  naad 
not  ba  idantieal  to  It.  In  addition,  wa  naad  to 
aanlpulata  tha  aouroa  pitch  and  to  niniaiaa  tha 
nuabar  of  bita  naadad  to  rapraaant  tha  aouroa.  In 
ordar  to  aeooapliah  thia  task,  wa  first  aaka  uaa  of 
an  laportant  proparty  of  apaaoh  paroaptlon,  naaaly 
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5.  PROPOSED  SOURCE  MODEL 


that  It  la  ralatlvaly  Inaanaltlva  to  tha  ahort-tlaa 
phaaa.  Tharafora,  In  ordar  to  aodal  tha  rasldual 
a(t)  to  aaat  our  raqulraaanta,  wa  naad  only  look  at 
Its  spaotrua  and,  axcapt  for  pltoh,  dlarasard  Its 
phasa  structura  for  tha  aosMnt. 

Fi<.  3 shows  tha  slsnal  powar  spaotrua  of  25.6 
as  of  a 10  kHs  saaplad  slsnal  In  tha  alddla  of  tha 
vowal  [I]  In  tha  word  "list",  and  tha  oorraspondlng 
rasldual  spaotrua.  Tha  rasldual  was  obtalnad  by 
Invarsa  flltarlnf  tha  spaaoh  slsnal  with  a 20th 
ordar  llnaar  pradlotlon  Invarsa  flltar.  If  soaahow 
ona  oould  s«narata  an  axoltatlon  u(t)  whosa 
spaotrua  la  Idantloal  to  tha  rasldual  spaotrua,  tha 
synthatlo  spaaoh  would  than  sound  (alaost)  tha  saaa 
as  tha  orlslnal. 

Tharafora,  our  ala  In  davaloplns  aouroa  aodals 
will  ba  to  obtain  an  axoltatlon  spaotrua  that  Is  as 
olosa  as  posslbla  to  tha  rasldual  spaotrua. 
Furtharaora,  wa  wish  to  obtain  suoh  an  axoltatlon 
spaotrua  uslns  only  tha  buss  and  hlsa  souroas 
dasorlbad  In  Saotlon  2.  Tha  aouroa  aodala  will 
staa  naturally  froa  axaalnlns  tha  oharaotarlatlos 
of  rasldual  spaotra. 

«.  CHARACTERISTICS  OF  RESIDUAL  SPECTRA 

In  Fls.  3,  tha  rasldual  spaotrua  shows  a olaar 
parlodlolty  up  to  about  3.5  kHs,  and  a laok  of 
parlodlolty  abova  that  fraquanoy.  Tha  parlodlolty 
oorrasponds  to  haraonlos  of  tha  pltoh  fundaaantal 
fraquanoy.  By  looking  at  rasldual  spaotra  of  othar 
sounds  It  baooaas  aaply  olaar  that  tha  axlatanoa  of 
aparlodlo  fraquanoy  bands  In  sonorant  sounds  Is 
quits  oowon.  Hhlla  In  Fig.  3 ona  oan  Idantlfy 
only  two  bands.  It  Is  posslbla  to  hava  savaral 
parlodlo  and  aparlodlo  adjaoant  bands  In  5 kHs. 
For  Bora  axaaplas,  tha  raadar  Is  rafarrad  to  tha 
work  of  Fujlnura  [A],  who  studlad  voloa 
aparlodlolty  by  axaalnlng  short-tlM  signal 
spaotra. 

Partial  davololng  of  eartaln  sounds  Is 
wall-known  froa  physloal  oonsldaratlona.  For 
axaapla,  tha  davololng  of  [s]  above  about  1 kHs  la 
well  raoognlsad  and  has  long  bean  taken  advantage 
of  In  tha  synthesis  of  aore  natural  voload 
frloatlvas.  On  tha  othar  hand.  It  Is  also  known 
that  In  tha  produotlon  of  tha  tansa  front  vowal 
[1],  the  oonstrlotlon  aay  baooae  narrow  enough  to 
generate  soaa  turbulenoa,  whloh  Is  seen  as 
davololng  of  frequanolas  above  about  3 kHs. 
However,  aost  synthaslzars  to  data  hava  not  taken 
advantage  of  this  faot. 

In  addition  to  the  foregoing  types  of  souroas 
of  davololng,  Fujlaura  [A]  has  hypothaslsad  that 
soaa  of  the  speotral  davololng  aay  ba  due  to 
aparlodlQltles  or  Irregularities  In  tha  vooal-oord 
aovaaant.  -Wa  hava  notloed  that  speotral  davololng 
oftan  ooours  during  transitions  between  different 
sounds.  Including  sonorant-sonorant  transitions. 
In  contrast  to  the  exaaples  given  In  tha  previous 
paragraph,  we  balleva  that  tha  spectral  davololng 
due  to  vooal-oord  Irregularities  and/or  speotral 
transitions,  aay  In  faot  be  an  artifact  of  tha 
spectral  astlMtlon  process.  Mhathar  suoh  devoload 
regions  should  be  synthaslsad  using  a nolaa  aouroa 
la  questionable. 

In  conclusion,  rasldual  spaotra  aay  be 
coapletely  periodic  (voiced),  ooaplataly  aperiodic 
(unvolcad),  or  aay  contain  regions  that  are 
parlodlo  and  others  that  arc  aparlodlo.  The 
question  now  Is  how  to  aodal  such  spaotra  using  tha 
buct  and  hiss  souroas. 


Ona  reasonable  source  aodal  would  divide  tha 
spaotrua  Into  a nuabar  of  banda.  Eaoh  band  would 
than  bn  excited  by  tha  buzz  source  If  the  band  Is 
oonsldarad  periodic,  and  by  the  hiss  source  If  tha 
band  Is  oonsldarad  aperiodic.  Fujlaura  [A]  used  a 
3-band  aodal  In  his  axparlaant,  and  reported  an 
laprovaaant  in  spaaoh  naturalness.  However,  given 
our  obsarvatlons  that  spectral  aparlodlcltles  aay 
not  naoassarlly  result  froa  turbulent  axeltatlona, 
wa  hava  chosen  a different  aodal.  In  our  aodal,  wa 
shall  oonaldar  all  spectral  aparlodlo  raglooa  that 
are  In  batwaen  two  periodic  regions  to  ba  la  faot 
parlodlo.  In  othar  words,  only  tha  band  above  tha 
parlodlo  region  with  tha  highest  fraquanoy  will  ba 
oonsldarad  to  bo  aparlodlo  and  ganaratad  by  a 
turbulent  source.  Our  raisons  for  this  oholca  are 
twofold:  (a)  Turbulent  sources  are  aora  likely  to 
exolta  higher  fraquenelas;  and  (b)  Exoaaslva 
davololng  oan  ba  as  degrading  to  quality  as 
axeasslva  voicing. 

Tha  resulting  aodal  la  shown  in  Fig.  A.  It  Is 
a Blxed-souroa  aodal  with  tha  buss  aouroa  asoltlng 
a tlaa-varylng  low-fraquanoy  region  of  tha 
apaotrua,  and  tha  hiss  souroa  asoltlng  the 
raaalnlng  high- fraquanoy  region.  Tha  salaotlva 
axoltatlona  are  realised  by  passing  tha  pulaa 
axoltatlon  through  a low-pass  filter  with  outoff 
F , and  tha  noise  eseltatlon  through  a high-pass 
filter  with  tha  saaa  cutoff  fraquanoy  F..  Tha 
outputs  of  the  two  filters  are  than  added, 
Bultlpllad  by  tha  source  gain  and  applied  to  the 
spectral  shaping  flltar  as  tha  axoltatlon  signal. 
The  aodal,  than,  has  only  two  paraaatars:  tha 
outoff  fraquanoy  P.,  and  tha  pltoh  period  T whan 
F > 0.  Sinoa  saall  changes  In  P.  are  not 
percept Ibla,  It  Is  sufficient  to  quantise  F.  Into 
2-3  bits  for  transalsslon  purposes.  ” 

6.  IMPLEMENTATION 

a)  Extraction  of  Source  Paraaatars 

Tha  only  dlffaranoe  batwaen  paraaatar 
extraction  for  tha  new  source  aodal  and  traditional 
pltoh  extraction  Is  that  tha  V/U  binary  daolslon 
has  bean  raplaoad  by  the  dataralnatlon  of  a 
Bulti-valuad  paraaatar  F.,  In  our  aodal.  Tha 
extraction  of  tha  pltoh  paPlod  Is  unchanged.  Pitch 
period  dataralnatlon  Is  ralatlvaly  straightforward; 
uny  scheaes  exist  that  are  quite  adequate. 

Just  as  T/U  decision  algorlthas  hava 
proliferated,  aany  algorlthas  will  ba  davalopad 
that  attaapt  to  ooaputa  F.  In  a paroaptually 
aatlsfactory  Banner.  Tha  aathod  wa  have  ohoaaa 
thus  far  Is  a peak-pleklr.g  algorltha  on  the  signal 
spaotrua.  Tha  algorltha  datfralnaa  parlodlo 
regions  of  tha  spaotrua  by  axaalnlng  tha  aoparatlon 
batwasn  consaoutlva  peaks  and  doteralnlag  whathor 
the  separations  are  tha  saaa,  within  soaa  toloraaea 
level.  F.  Is  taken  to  ba  the  highest  fraquanoy  at 
whloh  the  spaotrua  is  considered  to  ba  parlodlo. 

b)  Flltar  laplaaantatlons 

In  our  Initial  laplaaantatlon  wo  rounded  tha 
value  of  F.  to  tha  nearest  500  is.  Tharafora,  wa 
naodad  lowpass  and  hlghpass  filters  with  outoff 
frequanolas  separated  by  500  Rs.  Tha  flltar 
designs  ware  than  stored  and  used  In  tha  synthesis 
as  tha  naad  arose. 

For  each  value  of  P.,  the  3 dt  points  for  tha 
lowpass  and  hlghpasa  flltars  ware  designed  to  ba 
equal  to  P^,  In  ordar  that  tha  spaotrua  of  tha 
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riMl  Mcitatlon  My  b«  flat  ••  possible.  The 
roll-off  of  the  filters  ms  considered  to  be  of 
secondary  isportance,  but  should  not  be  very  sharp 
in  any  case.  We  considered  FIR  (finite  iapulse 
response)  as  well  as  recursive  (low  order 
Butterworth)  filtbrs.  Both  types  of  filters  gave 
siallar  perceptual  results. 

7.  RESULTS 

Using  the  iapleMntation  described  in  Section 
6,  we  coapared  the  resulting  syntheses  to  those 
using  the  binary  V/U  aodel  in  the  context  of  a 
linear  prediction  (LPC)  vocoder.  A nuaber  of 
sentences  frca  mle  and  feaale  speakers  (5]  were 
used  in  coaparing  the  two  analysls-aynthesis 
systeas.  No  quantisation  of  paraaeters  (except  for 
F.)  ms  perforaed.  One  of  the  sentences  had  a 
concentration  of  fricative  sounds  ■His  vicious 
father  has  seizures,*  and  another  was  a nonnasal 
sonorant  sentence  "Why  mre  you  away  a year,  Roy7" 
Other  sentences  were  aore  general.  With  the  V/U 
source,  the  fricative  sentence  sounded  particularly 
buzsy  for  both  Mle  and  fsMle  speakers,  while  the 
sonorant  sentence  was  judged  as  buzsy  only  for 
low-pitched  Mle  speakers.  The  buzzlness  in  both 
sentences  ms  greatly  reduced  when  using  the 
alzed-source  aodel.  In  general,  the  buzzlness  ms 
always  reduced  with  the  new  aodel.  However,  for 
aoM  aentences  the  new  synthesis  produced  certain 
arnll  background  noises.  Upon  careful  listening, 
it  was  deteralned  that  soae  of  those  noises  were 
present  in  the  V/U  synthesis  but  were  msked  by  the 
buzzlness.  The  other  noises  my  be  due  to 
inaccurate  deteralnation  of  P and/or  to  the 
particular  lapleaentation  of  the  aodel. 

Overall,  listeners  thought  that  the  new  oodel 
perforaed  better  for  femle  speakers  (a  pleasant 
surprise,  for  a change).  The  new  synthesis  ms 
■rmpler"  and  aore  in  line  with  femle  speech  which 
is  considered  to  be  aore  breathy  than  mle  speech. 

A nuaber  of  listeners  reported  that  the  new 
synthesis  bad  a certain  "fullness*  that  was  absent 
with  the  V/U  synthesis.  He  Interpret  this  as  an 
Indication  of  the  grmter  naturalness  resulting 
fro*  the  new  aodel. 

8.  REVIEW  OF  RELATED  WORE 

The  only  other  work  m know  of  where  alxed 
excitation  was  used  with  LPC  vocoders  was  that  of 
Itakura  and  Salto  [6].  But  there,  the  two  sources 
excited  the  whole  speetrua  siaultaneously , with  the 
"degree”  of  voicing  being  controlled  by  the 
relative  aanlitudea  of  the  sources.  The  results 
were  not  encouraging  [71. 

After  the  developMnt  of  our  aodel  over  two 
years  ago,  we  becam  aware  of  Fujiaura's  work 
[8,4],  who  aa  far  as  we  know,  ms  the  first  to 
suggest  and  test  a frequency-selective  aixed-souroe 
aodel.  His  work,  which  we  mentioned  earlier,  ms 
perforaed  in  the  context  of  a pitch-excited  channel 
vocoder.  During  the  writing  of  this  paper, 
Fujiaura  brought  to  our  attention  his  other  work 
with  Kato  et  al.  [9],  where  a variable  cut-off 
frequency  like  ours  ms  eaployed,  but  using  a 
different  algorltha  to  deteraine  the  cut-off.  The 
work  ms  done  with  a hybrid  voloe-ezoited  and 
pitch-excited  channel  vocoder,  and  they  reported 
excellent  results.  Coulter  [10]  used  aixed 
excitation  for  the  synthesis  of  voiced  fricatives; 
however,  the  cut-off  between  the  low  and  high 
frequency  bands  ms  fixed. 


In  speech  synthesis,  mixed  excitation  has  been 
used  routinely  for  the  synthesis  of  voiced 
obstruents  (see,  for  exaaple,  [1,11]).  The 
parallel  formnt  synthesizer  of  Holmes  [1]  allows 
for  variable  alxed  excitation,  and  ms  especially 
used  in  transitions  between  unvoiced  and  voiced 
sounds.  Upon  careful  reading,  it  becaae  clear  to 
us  that  the  spirit  of  Holaes*  synthesizer  is 
slallar  to  oura,  except  that  the  controls  in  his 
case  are  aore  coapllcated.  A aore  recent  hardware 
synthesizer  by  Strube  [12]  allows  for  mixed 
excitation  using  a single  variable  RC-circuit. 

There  have  been  numrous  attempts  at  reducing 
buzzlness  by  changing  the  shape  of  the  pulse  in 
voiced  excitation,  but  to  no  avail.  Recently, 
Sambur  et  al.  [13]  reported  a reduction  in 
buzzlness  by  changing  the  pulse  width  to  be 
proportional  to  the  pitch  period.  Unfortunately, 
changing  the  pulse  width  changes  the  excitation 
speetrua;  the  effect  is  that  of  a variable  lowpass 
filter.  Spectrally  flattening  the  pulse  before 
excitation  cancelled  the  reduction  in  buzzlness 


9.  DISCUSSION 

a)  Buzzlness  and  Naturalness 

It  is  interesting  that  the  mixed-source  model 
appears  to  reduce  two  seemingly  different  types  of 
buzzlness:  the  buzzlness  in  voiced  fricative 
synthesis,  and  the  buzzlness  in  sonorant  synthesis 
associated  Mlnly  with  low-pitched  voices.  Our 
hypothesis  is  that  the  imi  tvoea  of  buzzlnesa.  in 
fact,  result  from  oroceas:  that  of  an 

BECCaa  in  DUZZ.  source  excitation.  Thus,  our 
general  rule  is  that: 

too  much  buzz  — w "buzzlness" 

too  much  hiss  —m  "breathiness"  or  "raspiness* 

where  the  arrow  is  to  be  read  aa  "results  in".  If 
more  of  the  spectrum  is  excited  by  the  buzz  source 
than  is  necessary  for  naturalness,  the  result  is 
buzzlness.  Similarly,  if  there  is  more  hiss 
excitation  than  is  necessary  for  naturalness,  the 
result  is  breathiness  or  rasplness.  This  leads  us 
to  a functional  definition  of  naturalness,  as  it 
relates  to  mixed  excitation: 

Naturalness  is  achieved  by  that  proper  mix  of 
buzz  and  hiss  excitations  that  leads  to  a 
synthesis  that  is  neither  buzzy  nor  breathy 
or  raspy. 

b)  Hodulation  and  Naturalness 

Certain  synthesizers,  such  as  that  of  Klatt 
[ 1 1 j p Bodulattt  th6  hl88  8ourc8  by  tho  buzz  source 
for  the  synthesis  of  voiced  fricatives.  While  it 
is  known  that  the  noise  source  in  the  vocal  tract 
is  in  fact  modulated  by  the  vocal  cord  output,  it 
is  not  clear  that  such  modulation  is  necessary  for 
achieving  naturalness  in  synthetic  speech. 
Whatever  effect  modulation  has,  it  appears  to  be  of 
a secondary  nature.  The  synthesizer  of  Holmes  [1] 
does  not  contain  any  modulation,  and  he  reported 
very  natural  speech  synthesis.  Although  initially 
m included  modulation  in  our  model,  it  is  our 
opinion  at  this  point  that  source  modulation  is  not 
necessary  for  natural  synthesis,  and  therefore  we 
have  decided  not  to  Incorporate  it  as  part  of  the 
aodel. 


o)  Phas«  and  Naturalnaaa 
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Ratio'  Vooodar  in  Rnaiog  Hardwara,*  IBEB  Trans. 
Aooustios,  Spaaoh  and  Signal  Prooassing,  pp. 
387-391,  Oot.  1977. 

13.  M.  Saabur,  4.  Rosanbarg,  L.  Rabinar  and 
C.  HoOonagal,  "On  Raduoing  tha  Buss  in  LPC 
Synthasis,”  1977  lEEB  Int.  Conf.  Acoustics, 
Spaach  and  Signal  Proeassing,  Hartford,  Conn., 
pp.  AOI-AOA,  1977. 

It.  B.  Atal,  parsonal  cosaunieation. 


It  is  ganarally  agraad  that  propar  phasa 
dataraination  of  buss  asoitation  should  load  to 
aora  natural  synthasis.  Furtharaora,  such  phaaa 
cannot  ba  in  tha  fora  of  soaa  "optiaal*  pitch  pulaa 
shapa.  Tha  phaaa  aust  changa  froa  ona  pitch  pulsa 
to  tha  'naxt  in  soaa  appropriata  aannar.  Thus  far, 
our  Bodal  calls  for  an  all-pass  pulse,  but  doas  not 
specify  tha  phase.  Bxactly  how  tha  phasa  should 
change  between  pulsas  is  a subject  for  future 
research. 


10.  CONCLUSION 
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Ha  have  presented  a frequancy-salactiva 
otxed-souree  excitation  aodal  for  use  in  both 
speech  coapression  and  speech  synthesis.  The  aodel 
has  a single  continuous  paraaater,  F , which 
divides  tha  spaetrua  into  two  regions,  with  tha 
buss  source  exciting  the  low  frequency  region  below 
F.,  and  tha  hiss  source  exciting  tha  high  frequency 
region  abova  F..  Naturalness  (no  bussinass  or 
breathlnass)  is  aobiovad  by  tha  propar  nix  of  tha 
two  sources,  i.a.,  by  tha  proper  dataraination  of 
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Phoneme-Specific  Sentences 


All  voiced,  sonorant 

Type  11:  glides  > w,  r,  y vowels  (no  zeroes) 
Type  12:  glides  ■ w,  r,  y,  + 1 

All  voiced,  sonorant,  with  zeroes 


Type  21 
Type  22 
Type  23 


nasal  > m,  n,  ng 

nasals  -t-  1 - m,  n,  ng,  1 

glides  + nasals  * w,  r,  y,  1,  m,  n,  ng 


Stops  and  Affricates 


Type  41 
Type  42 
Type  43 
Type  44 
Type  45 


voiced  stops  » b,  d,  g 

voiced  stops  + affricate  ■ b,  d,  g,  j 

unvoiced  stops  = p,  t,  k 

unvoiced  stops  + affricate  = p,  t,  k,  ch 

stops  + affricates  = b,  d,  g,  j,  p,  t,  k,  ch. 


Fr icatives 

Type  51:  voiced  fricatives  « v,  dh,  z,  zh 
Type  52:  unvoiced  fricatives  = f,  th,  s,  sh 
Type  53:  fricatives  = f,  th,  s,  sh,  v,  dh,  z,  zh 

Place 

Type  61:  all  labials  = p,  b,  f,  v,  w,  m 

Type  62:  all  tongue-tip  = t,d,th,s,sh,dh,z,zh,n,ch,j ,l,r,y 
Voicing 

Type  71:  voiced  stops,  affric,  fries  » b,  d,  g,  j,  v,  dh,  z, 
zh 

Type  72:  unvoiced  stops, affric, fries  * p,  t,  k,  ch,  f,  th, 
s , sh 

Type  75:  all  voiced  = b,d,g,  j,  v,dh,z,zh,  m,n,ng,  l,r,w,y 
Type  76:  all  unvoiced  = p,t,k,  ch,  f,th,s,sh,  h 


Type  80:  Miller  & Nicely  Demos  > all  consonants  except  ch, 
j»  h,  y 

Type  81:  All  consonants 
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Sentences 


Type  11:  glides  ■ w,  r,  y + vowels 

* 1.  Why  were  you  away  a year,  Roy? 

2.  Why  were  you  weary? 


Type  12:  glides  ■ w,  r,  y,  + 1 

1.  Why  were  you  all  weary? 

2.  Our  lawyer  will  allow  your  rule. 

3.  Our  rule  will  allow  you  a lawyer. 

4.  We  really  will  allow  you  a ruler. 


Type  21:  nasal  ■■  m,  n,  ng 

* 1.  Nanny  may  know  my  meaning. 

2.  Many  young  men  owe  money. 

3.  I'm  one  man  among  many. 

4.  When  may  we  know  your  name? 

5.  I'm  naming  my  own  mine. 

6.  No-one  knowing  my  name. 

7.  I 'm  no  mean  man. 

8.  I know  many  a mean  man. 

9.  I know  many  mean  men. 

10.  One  name  among  many. 

11.  I'm  known  among  men. 

12.  A man  on  a moon. 

13.  I know  no  minimum. 

14.  I'm  naming  one  man  among  many. 

15.  I'm  owing  no-one  any  money. 

16.  Anne  'n  May  own  many. 

17.  Anne  'n  Arnie  own  one.  (n  only) 


Type  22:  nasals  -t-  1 ■ m,  n,  ng,  1 

1.  I'm  well  known  among  men. 

2.  Nine  men  moaning  all  morning. 


I 
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Type  23t  glides  * 


r,  y,  1,  m,  n,  ng 


1.  Where  were  we  all  wrong? 

2.  You  were  wrong  all  along. 

3.  I'll  warn  Ron  away. 

4.  I know  you're  a loner. 

5.  I know  you're  all  alone. 

6.  I really  »ean  weighing  in. 

7.  Why  are  you  nailing  Wally? 

8.  When  will  our  yellow  lion  roar? 

9.  Will  you  ring  an  alarm? 

10.  A morning  alarm  rang. 

11.  An  alarm  rang  in  only  one  room. 

12.  An  alarming  rule. 

13.  We  may  allow  a new  ruling. 

14.  A lawyer  may  well  allow  a new  ruling. 

15.  An  alarm  rang  a warning. 

16.  You  will  alarm  me  no  morel 

17.  I'm  learning  a (my)  new  role. 

18.  I know  you're  really  alone. 

19.  I'll  remain  in  my  narrow  room. 

20.  We'll  rely  on  no-one. 

21.  Anyone  may  rely  on  a mail-man. 

22.  I'll  wear  a maroon  ring. 

23.  I'm  wearing  my  maroon  ring. 

24.  We'll  remain  all  morning. 

25.  You'll  remain  in  your  room  all  morning. 

26.  We're  all  in  mourning. 

27.  We'll  allow  you  a new  loan. 

28.  You're  learning  a new  rule. 

29.  I'll  lie  in  an  alarming  manner. 

30.  Why  lie,  when  you  know  I'm  your  lawyer? 

31.  Any  animal  may  run  away. 

32.  A normal  animal  will  run  away 

33.  Mail  me  an  aluminum  railing. 

34.  Marilyn  alone  will  marry  me. 

35.  I'll  willingly  marry  Marilyn. 

36.  I'm  more  normal  in  early  morning. 
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Type  41:  voiced  stops  « b,  d,  g 

1.  Do  you  abide  by  your  bid? 

2.  Grab  a doggie  bag. 

3.  A greedy  boy  died. 

4.  Dad  would  buy  a big  dog. 

5.  Bobby  did  a good  deed. 

6.  I begged  Dad  buy  a dog. 

7.  Why  did  Gay  buy  a bad  egg? 

8.  Did  Bobby  do  a good  deed? 

9.  Buy  Dad  a bad  egg. 

Type  42:  voiced  stops  + affricate  ■ b,  d,  g,  j 

1.  Did  George  do  a good  job? 

2.  Greg  adjudged  Bobby  dead. 

Type  43:  unvoiced  stops  » p»  t,  k 

1.  Kate  typed  a paper. 

2.  Take  a copy  to  Pete. 

3.  Pat  talked  to  Kitty. 

4.  Quite  a cute  act. 

5.  Peter  took  out  a potato. 

6.  Patty  cut  up  a potato  cake. 

Type  44:  unvoiced  stops  + affricate  « p,  t,  k,  ch 

1.  Chip  took  a picture. 

2.  Teacher  patched  it  up. 

3.  Chat  quietly  to  teacher. 

4.  Quite  quiet  at  church. 

5.  Catch  a paper  cup. 

6.  Actuate  a paper  copier. 

7.  Teacher  taped  up  a packet. 

8.  Keep  quite  a cute  picture. 

9.  Keep  quiet  at  church. 

10.  Capture  a cute  puppy. 

11.  Teacher  typed  up  a paper. 

12.  Katie  tacked  up  a cute  picture. 

Type  45:  stops  + affricates  = b,  d,  g,  j,  p,  t,  k,  ch. 

* 1.  Which  tea-party  did  Baker  go  to? 

2.  We*d  better  buy  a bigger  dog. 

3.  Georg ie  had  to  chew  tobacco. 
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Type  51:  voiced  fricatives  « v,  dh,  z,  zh 

1.  View  these  azure  vases. 

2.  They  use  our  azure  vials. 

3.  There's  our  azure  vial. 

4.  There's  usually  a valve. 

5.  Those  waves  veer  over. 


Type  52:  unvoiced  fricatives  > f,  th,  s,  sh 

1.  I saw  three  fish. 

2.  A thief  saw  a fish. 

3.  Three  chefs  face  a thief. 


Type  53:  fricatives  > f,  th,  s,  sh,  v,  dh,  z,  zh 

* 1.  His  vicious  father  has  seizures. 

2.  Whose  shaver  has  three  fuses? 

3.  Three  of  the  chefs  saw  the  thieves. 

Type  61:  all  labials*  p,  b,  f,  v,  w,  m 

(none) 

Type  62:  all  tongue-tip  * t,d,  th^SrSh,  dh,z,zh,  n, 

1.  The  judge's  harsh  decision  really  touched  the  youth. 

2.  Each  decision  shows  the  jury  he  lies  through  his  yellow 
teeth. 

3.  Such  a rash  allusion  to  dosage  teases  the  youth. 

4.  Seth  yawns  at  each  rash  allusion  to  tne  dosage. 

5.  The  designers  really  earned  the  judge's  derision  this  year. 

6.  Each  allusion  to  Daisy's  agility  lessens  her  attention. 

7.  Each  decision  shows  that  he  lies  through  his  yellow  stained 
teeth. 

8.  John  drowned  his  sorrows  in  gin  and  orange  juice. 

Type  71:  voiced  stops,  affric,  fries  » b,  d,  g,  j,  v,  dh,  z,  zh 

(none) 

Type  72:  unvoiced  stops,affric, fries  » p,  t,  k,  ch,  f,  th,  s, 

sh 

(none) 
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Type  75:  all  voiced  - b,d,g,  j,  v,dh,z,2h,  m,n,ng,  l,r,w,y 

1.  Does  John  believe  you  were  measuring  the  gun? 

2.  Your  brother's  vision  was  gradually  dimming. 

3.  The  regular  division  was  led  by  a young  major. 

4.  I gather  you  will  be  abandoning  the  major  revisions? 

5.  The  young  major's  evasions  were  growing  bolder. 

Type  76:  all  unvoiced  ■ p,  t,  k,  ch,  f,  th,  s,  sh,  h 

1.  I hope  she  chased  her  fox  to  earth. 

2.  A thickset  officer  pitched  out  her  hash. 

3.  He  checked  through  fifty  ships. 

4.  She  swiftly  passed  a health  check. 

5.  He  steps  off  a path  to  cash  a check. 

Type  80:  Elliptic  sentences  (all  except  h,ch,j,y),  from 

Miller,  G.  A.  & Nicely,  P.  A.,  An  analysis  of 
perceptual  confusions  among  some  English  consonants, 

J.  Acoust.  Soc.  Amer.  1955,  338-352. 

1.  The  wealthy  banker  from  Persia  should  be  a good  citizen. 

2.  The  issue  of  McCarthy  is  forcing  a great  division  among 
Republicans. 

3.  Division  can  be  a fast  operation  with  logarithms. 

4.  She  thinks  she  bought  some  good  rouge  and  lipstick  from  one 
of  these  men. 

Type  81:  All  consonants. 

1.  If  the  treasure  vans  got  so  much  publicity,  we  think  you 
should  hide  your  share. 

2.  The  voyagers  have  ground  the  crankshaft  with  (th) 
unimpeachable  precision. 

3.  The  old-fashioned  jacket  was  giving  you  both  so  much  humorous 
pleasure. 

4.  Disillusioned  taxpayers  think  the  average  gambler  half  wishes 
to  cheat. 

5.  The  average  disillusioned  gambler  thinks  he  wishes  for  a 
cheap  yacht. 

6.  Nothing  could  be  further  from  reality  than  his  illusion  of 
chasing  your  gorgeous  sheep  away. 

7.  She  thinks  even  the  pale  rouge  you  bought  was  much  too  gaudy 
for  her  age. 


% 
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the  IEEE  International  Conference  on 
and  Signal  Processing,  Hartford,  CT, 


QUALITY  RATINGS  OF  LPC  VOCODERS;  EFFECTS  OF 
NUMBER  OF  POLES,  QUANTIZATION,  AND  FRAME  RATE 


A.N.F.  Huggins,  R.  Vlswanathsn,  and  J.  Makhoul 


Bolt  Baranak  and  Nawman  Inc. 

50  Moulton  Straat,  Canbridge,  Mass.  02136 


Four  valuas  for  nuabar  of  polas  (13, 
11,  9,  8)  wara  oonblnad  faotorlally  with 
thraa  valuas  of  stop  siza  for  quantization 
of  log  araa  ratios  (0.5,  1,  2 dB),  and 
with  four  valuas  of  fraaa  rata  (100,  67, 
50,  33  per  saoond),  to  dafina  48  LPC 
vooodor  systaas  with  ovarall  bit  ratas 
ranging  from  8.7  down  to  1.3  kbps. 
Subjects  rated  the  DEGRADATION  of  signal 
quality  by  each  vocoder,  for  each  of  seven 
sentence  tokens,  chosen  to  challenge  LPC 
vocoders  maximally.  The  results  define 
the  combination  of  LPC  parameters  yielding 
the  best  speech  quality  for  any  desired 
overall  bit  rate. 


This  study  was  performed  to  measure 
how  the  quality  of  LPC  vocoded  speech  is 
affected  by  three  different  methods  of 
reducing  bit  rate.  These  were: 

1)  reducing  the  number  of  poles  used  for 
spectral  matching, 

2)  coarsening  the  step  size  used  in 
quantizing  the  coefficients  (log  area 
ratios,  Vlswanathsn  A Makhoul,  1975), 

3)  reducing  the  number  of  frames  of 
coefficients  transmitted  per  second. 

To  establish  the  best  operating 
point,  for  a range  of  different  bit  rates, 
it  is  necessary  to  perform  a factorial 
study,  in  which  each  value  of  a parameter 
occurs  with  every  combination  of  values  of 
the  other  parameters.  He  used  the 
following  set  of  parameter  values:  Number 
of  Poles,  P:  13,  11,  9,  or  8;  Quantization 
Step  Size,  Q:  0.5,  1.0,  or  2.0  dB;  and 

Frame  Rate,  R:  100,  67,  50,  or  33  psr 

second,  yielding  48  LPC  systems  (4x3x4). 
Two  additional  systems  were  Included.  One 
was  an  LPC  system  with  13  poles, 
quantization  step  size  of  0.25  dB,  and 
transmission  rate  of  100  frames  per 
second.  The  other  consisted  of  PCM  speech 
at  110  kbps  (i.e.  the  waveform  sampled  at 
10  kHz  and  quantized  to  11  bits),  to  act 
as  an  undegraded  anchor.  The  bits  per 
frame  for  each  combination  of  number  of 
poles  and  quantization  step  size  appear  in 
Table  1. 


No.  of  Poles 
13  11  9 


Quantization 
Step  Size 

0.25  dB 
0.5  dB 

1.0  dB 

2.0  dB 


Table  1:  Bits  per  frame  for  all 
combinations  of  number  of  poles  and 
quantization  step  size  used  in  the  present 
study  (excluding  pitch  and  gain). 


Pitch  and  gain  were  transmitted  at 
the  same  frame  rate  as  the  coefficients. 
The  overall  bit  rate  for  any  system  is 
calculated  by  adding  6 bits  of  pitch 
coding  and  5 bits  of  gain  to  the  bits  per 
frame,  and  multiplying  by  the  appropriate 
frame  rate.  The  overall  bit  rate  of  the 
LPC  systems  ranged  from  8700  bps  (P  > 13, 
Q s 0.25  dB,  R s 100/sec),  down  to  1267 
bps  (P  X 8,  Q : 2.0  dB,  R X 33/sec).  Note 
that  these  rates  do  not  include  the 
benefits  of  Huffman  coding,  in  which  the 
most  frequently  used  values  are  assigned 
the  shortest  codes.  This  procedure  can 
further  reduce  bit  rates  by  about  20} , 
with  absolutely  no  change  in  the 
coefficient  values  transmitted  (Makhoul  et 
al,  1974). 


Our  earlier  subjective  quality  tests 
showed  the  necessity  of  passing  all 
sentence  materials  through  all  systems 
(Huggins  A Nickerson,  1975).  Other 
researchers  have  reached  similar 
conclusions  (Pachl  et  al,  1971 )•  In  our 
earlier  tests,  we  developed  a set  of  six 
sentences,  each  read  by  six  talkers,  that 
was  both  representative,  in  that  it 
covered  a wide  range  of  speech  events  and 
talker  characteristics,  and  also 
challenging,  in  that  some  speech  material 
was  included  that  would  fully  extend  any 
LPC  vocoder's  abilities.  Unfortunately, 
we  could  not  use  all  36  speaker-sentence 
combinations  in  the  present  study,  since 
passing  them  through  all  50  vocoder 
systems  would  have  made  the  study 
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unmanageably  large.  We  therefore  aelected 
a subset  of  seven  speaker-sentence 
combinations,  and  confirmed  that  they  were 
adequately  representative  of  the  full  set 
by  repeating  the  NDPREF  analysis  using 
Just  the  data  from  the  subset. 

The  subset  of  sentence  tokens  that 
was  selected  consisted  of:  JB1,  002,  RS3, 
AR4,  JB5,  DK6,  and  RS6,  where  the  Initials 
Identify  the  speaker  and  the  number 
Identifies  the  sentence.  Relevant  details 
of  the  sentences,  and  of  the  speakers' 
voices,  are  given  In  Table  2. 


ID 

FO 

Sentence 

JB1 

119 

Why 

were  you  away  a year,  Roy? 

DD2 

134 

Nannv  mav  know  mv  meaning. 

RS3 

195 

His 

vicious  father  has  seizures. 

AR4 

165 

Which  tea-party  did  Baker  go  to? 

JB5 

124 

The 

little  blankets  lay  around 

* 

on  the  floor. 

DK6 

97 

The 

trouble  with  swimming 

RS6 

193 

is  that  you  can  drown. 

Table  2: 

The 

seven  stimulus  sentences. 

with  the  speaker's  average  fundamental 
frequency  In  Hz. 


3.  Generation  of  Stimulus  Tapes 

Each  of  the  seven  Input  sentences  was 
digitized  (11  bits,  10  kHz),  and  passed 
through  each  of  the  50  simulated  vocoder 
systems,  to  yield  a total  of  350  different 
stimulus  Items. 

Earlier  studies  have  demonstrated 
that  a subject's  Judgment,  especially  of 
speech  stimuli,  can  be  strongly  affected 
by  the  preceding  stimulus  (e.g.  Huggins, 
1968).  It  is  Important  to  control  for 
effects  such  as  this  by  counterbalancing 
the  presentation  order.  A complete 
counterbalancing  of  the  50  vocoder  systems 
was  generated.  In  which  every  system 
followed  every  other  system  once,  with 
Independent  approximate  counterbalancing 
of  the  sentences.  This  required  only 
seven  passes  through  the  350  stimuli,  and 
had  the  further  advantage  that  even  within 
each  pass,  all  ranges  of  contrast  between 
successive  stimuli  occurred  equally  often, 
so  that  no  severe  departures  from  balance 
occurred  even  within  one  pass.  The 
sequence  was  generated  by  a trial  and 
error  search,  following  an  algorithm 
described  by  Williams  (1950).  No  system 
and  no  sentence  followed  itself. 

He  tried  to  further  reduce  sequence 
effects  (and  thus  improve  the  reliability 
of  the  data)  by  a novel  method.  A 
continuous  speech  babble,  at  the  same 
level  as  the  speech,  was  automatioally 
faded  in  and  out  again  during  the 


inter-stimulus  Interval.  We  hoped  that, 
by  analogy  with  the  "suffix"  effect  found 
in  studies  of  auditory  short  term  memory 
(Crowder  & Horton,  1969),  the  babble  would 
interfere  with  the  memory  trace  of  earlier 
stimuli,  on  which  sequence  effects 
presumably  depend.  The  babble  was 
developed  at  BBN  for  other  purposes 
(Kallkow  et  al , 1976).  The  babble  signal 
was  recorded  on  a separate  track  of  the 
tape,  to  permit  the  signal  to  be  played 
with  or  without  the  babble. 

Seven  experimental  tapes  were  then 
recorded.  Stimuli  were  presented  in 
blocks  of  ten,  at  a rate  of  one  every  7.5 
seconds,  with  a longer  gap  between  blocks. 


9.  Experimental  Procedures 

The  subject's  task  was  to  rate  the 
degradation  of  the  stimuli  he  heard.  This 
negative  attribute  was  chosen  for  scaling, 
as  in  our  earlier  experiment,  because  the 
scale  has  a natural  origin,  or  zero, 
corresponding  to  undegraded  speech. 
Instead  of  assigning  a number  to  his 
Judgment,  the  subject  made  his  response  by 
making  a mark  on  a 10  cm  line  on  his 
answer  sheet.  Two  visual  anchors  were 
provided  on  the  response  line.  The  left 
anchor  was  4 mm  from  the  left  end  of  the 
line,  and  was  marked  "PERFECT".  The  right 
anchor  was  1 cm  from  the  right  end  of  the 
line.  For  data  analysis,  the  response  was 
converted  into  the  distance  in  millimeters 
from  the  left  end  of  the  line  (not  the 
anchor)  to  the  subject's  mark  where  it 
crossed  the  response  line.  Thus  small 
numbers  correspond  to  high  quality,  and 
large  numbers  to  poor  quality. 

Nine  subjects  served  in  the 
experiment.  They  were  recruited  by  local 
university  summer  placement  offices:  all 
reported  having  normal  hearing.  Three  of 
the  subjects  made  the  first  five  passes 
through  the  350  stimuli,  and  six  more 
subjects  made  only  the  first  two  passes. 


5.  Bfl.ault3 

First,  to  check  on  the  reliability  of 
the  data,  the  responses  collected  on  each 
pair  of  passes  through  the  350  stimuli 
were  correlated,  for  each  subject.  All 
correlations  were  significant  well  beyond 
P<.001,  with  the  (Pearson  product-moment) 
coefficients  lying  between  0.48  and  0.83, 
almost  all  of  them  in  the  top  half  of  this 
range. 

The  mean  degradation  rating  was 
calculated  for  each  system,  and  these  are 
plotted  as  a function  of  overall  bit  rate 
in  Figs  1,  2,  and  3*  Each  system  is 
identified  by  three  digits,  corresponding 
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to  th«  parameter  level  for  P,  Q,  and  R, 
respectively.  Thus  system  231  used  level 
2 of  P (11),  level  3 of  Q (1.0  dB)  and 
level  1 of  R (100/seo),  as  shown  in  the 
key  to  the  figure.  The  means  of  the 
ratings  (N.B.  not  the  ratings)  have 
standard  deviations  of  about  1.5  points. 
Therefore  any  difference  between  two 
plotted  means  that  is  larger  than  about 
3-4  points  is  probably  significant  at 
P<.05  by  t-test.  (In  fact  it  is  likely 
that  much  smaller  differences  are  also 
significant,  since  this  test  does  not 
partial  out  the  variability  due  to  the 
sentence  and  subject.  This  can  be  done  by 
comparing  ratings  on  the  pair  of  systems 
of  interest  before  pooling  across  subjects 
and  sentences.) 


Figure  1:  Mean  degradation  rating  vs.  Bit 
Rate  for  48  LPC  vocoders.  Lines  Join 
"best"  systems  for  each  No  of  Poles. 


In  Figure  1 , a line  Joins  the  "best" 
systems  using  13  poles,  and  other  lines 
Join  the  best  systems  using  11,  9,  and  6 
poles.  From  inspection  of  Figure  1,  it  is 
clear  that  13-pole  systems  give  (slightly) 
better  quality  than  11-pole  systems  for 
most  bit  rates  above  2750.  11-pole 
systems  are  (slightly)  superior  between 
about  1500  bps  and  2750  bps.  These 
differences  are  small,  however,  and  are 


probably  not  significant.  The  beat  11  and 
13  pole  systems  are  substantially  better 
than  the  best  8 or  9 pole  systems  at 
comparable  bit  rates.  These  differences 
are  large  and  highly  reliable.  The  reason 
is  that  there  is  a highly  significant 
interaction  between  the  sex  of  the  talker 
(or  the  talker's  fundamental  frequency) 
and  the  number  of  poles.  This  confirms 
earlier  findings  (Huggins  & Nickerson, 
1975;  Huggins  et  al,  1976).  Averaging 
ratings  across  all  systems  with  the  same 
number  of  poles  shows  that  reducing  the 
number  of  poles  from  13  to  8 had 
relatively  little  effect  on  quality,  for 
the  three  sentences  spoken  by  females 
(RS3,  AR4,  RS6) , whereas  there  is  a 
massive  reduction  of  quality  for  male 
voices  when  the  number  of  poles  is  reduced 
below  11. 


Figure  2:  Degradation  vs.  Bit  Rate.  Lines 
Join  "best"  systems  for  each  Quantization 
Step  Size. 


Figures  2 and  3 present  comparable 
plots,  with  best  systems  Joined  for  each 
level  of  quantization,  and  for  each  level 
of  frame  rate,  respectively.  The 
differences  in  quality  between  different 
levels  of  quantization,  at  a given  bit 
rate,  are  significant  only  at  the  very  low 
bit  rates.  Here,  quality  is  less  affected 
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by  ooar8«nlng  quantisation  than 
raduoing  tha  nuabar  of  polaa. 
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Plgure  3:  Degradation  vs.  Bit  Rate.  Lines 
Join  "best"  systems  for  each  Prame  Rate. 


Pigure  3 shows  that  below  4.5  kbps, 
quality  can  be  substantially  improved,  at 
no  extra  cost  in  bit  rate,  by  reducing  the 
frame  rate  and  increasing  the  number  of 
bits  per  frame,  that  is,  by  Improving 
"static"  spectral  accuracy  at  the  expense 
Of  "dynamic"  spectral  accuracy.  Host  of 
these  quality  differences,  due  to  changing 
frame  rate  without  changing  overall  rate, 
are  highly  significant.  The  size  of  the 
effect  of  frame  rate  lends  further  support 
to  our  earlier  result  (Huggins  et  al, 
1976),  suggesting  that  a well  designed 
variable  frame  rate  transmission  scheme 
should  yield  aubstantlal  savlnga  in  bit 
rate  without  appreciable  loss  of  quality. 


further  analyses  of  these  data, 
including  multidimensional  analysis,  will 
be  reported  in  a separate  paper. 
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APPENDIX  10 

SPEECH-QUALITY  TESTING  OF  SOME  VARIABLE -FRAME- RATE  (VFR) 
LINEAR-PREDICTIVE  (LPC)  VOCODERS 


(Paper  published  in  the  Journal  of  the  Acoustical  Society 
of  America,  Vol.  62,  August  1977.) 


Speech-quality  testiiig  of  some  variable-frame-rate  (VFR) 
linear-pr^ictive  (LPC)  vocoders** 


A.  W.  F.  Huggins,  R.  VisiMnathan.  and  J.  Makhoul 
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Boh  BoroMk  tmi  Stmmto  loeorpontoi,  50  ttoohom  Stmt  Combridti.  Monackiiimt  03IU 
(RmM  23  Prinwy  1977;  raviNd  23  April  1977) 

VFR  tTin—iirinw  of  LFC  vooodw  coriHrim  ii  • tochaiqM  dovolopad  to  radooe  tbe  avonat  tnoMMoo 
rate  whhoui  appreciable  km  of  quality.  The  techwigMe  tnaanitt  panmeten  at  a vaiUMa  reia  ia 
aoooniaiice  with  the  chainim  characteriatica  of  the  ppaech  ripoal.  In  order  to  aaaetc  the  edhctirenwe  of 
VFR  trenemieeinn.  are  peribweod  an  eaperiment  to  oorepare  it  arith  taro  other  nethoda  for  tadnciat  9it 
rate;  (a)  redndnp  the  nomher  of  poke,  and  (h)  iaoaaainc  the  quaatieatioa  atop  eiae  of  the  LFC 
coelHcieBta  (tog  area  tatioa).  Thitty-two  atiranhia  aentetioea  arere  prepared  hy  paaaiap  foor  ntteraaoee  (2 
eeatanoea  X 2 epaakan)  throoph  eight  aooodar  ayetana  fat  a 3x2x2  tectorial  tfni|n.  tao  vahiea  arare 
aaaigBad  to  each  of  the  three  pawmatew:  avenge  ftane  rate,  namher  of  polaa,  and  tpiantifatinn  ttap  alaa. 
Eight  Mateneri  made  aeva»f^  nategory  ratiage  of  qnahty  dagndalioii.  The  reaalta  of  the  eiperiment 
thoar  that,  of  the  three  eeethoda  itndiad,  tee  VFR  teehaiqne  produced  the  Ughtri  quality  at  any  given 
trananiiaaion  rate  (or,  oquivaiantly,  yialdad  the  lowaet  bit  rate  for  a flxed  level  of  qMech  quaUtyX 

FACS  numbera:  43.7aLar,  43.706^  43.7(Ut 


INTRODUCTION 

Even  cursory  inspectioa  of  spectrogmms  of  spesch 
shows  that  the  rate  at  which  the  short-term  spectrum 
changes  can  vary  over  a wide  range.  In  stressed  vow- 
els, or  in  strident  fricatives  (s,  sh,  s,  sh),  the  spec- 
trum may  change  very  little  over  periods  as  long  as 
150-200  msec.  During  transitions  between  acoustically 
different  segments,  on  the  other  hand,  the  ^iectrum  may 
change  very  rapidly.  Variable  frame  rate  (Vnt)  LPC 
vocoders  take  advantage  of  this  variability  to  reduce 
their  average  bit  rate.  In  a typical  system,  Uw  power 
spectrum  at  a 20-msec  Inter^  of  input  speech  (a 
-frame**)  is  modeled  every  10  msec,  and  whenever  the 
spectrum  is  changing  rapidly,  every  frame  that  Is  ana- 
lysed ia  also  transmitted.  During  slowly  changing  parts 
Of  the  signal,  however,  a frame  is  not  transmitted  un- 
less it  is  different  from  the  preceding  transmitted  frame 
by  more  than  some  threshold.  Preliminary  tests  sug- 
gested that  VFR  transmission  could  reduce  the  frame 
rate  to  an  average  of  35  per  sec  or  less,  with  negligible 
loss  of  speech  quality  (Makhoul  cf  al. , 1974).  Such  a 
VFR  system  could  operate  directly,  without  any  inter- 
face, over  a time-asynchronous  or  variable-rate  chan- 
nel, such  as  the  ARPANET.  For  use  over  fixed-rate' 
chsjmels,  the  VFR  system  must  be  Interfaced  to  the 
channel  through  a tandem  at  transmit  and  receive  buf- 
fers, with  associated  data-flow  control.  This  Intro- 
duces additional  delay  into  the  transmission  path,  but 
recent  work  (Blackman  at  al. , 1977)  has  shown  that 
varlable-to-flxed  rate  conversion  can  be  achieved  with 
negligible  loss  of  quality  even  for  delays  as  short  as 
80  msec. 

The  experiment  to  be  described  was  performed  to  fol- 
low up  an  unexpected  result  in  an  earlier  stady,  in  which 
advantages  much  smaller  than  expected  were  found  for 
VFR  transmission.  The  purpose  of  the  earlier  study 
(Huggins  and  Nickerson,  1975)  was  to  develop  a small 
set  at  speech  materials,  for  use  la  quality  rating  studies 
of  LPC  vocoders.  We  argued  that  LPC  vocoding  can  in- 
troduce discrepancies  between  the  input  and  reconstituted 
speech  in  several  distinct  ways,  and  showed  that  these 


produced  different  effects  on  perceived  quality.  LPC 
vocoding  starts  by  modeling  the  spectrum  of  a short 
waveform  interval  (e.  g. , 20  msec)  as  the  response  of  an 
all-pole  filter.  The  more  coefficients  or  poles  that  can 
be  used  to  define  the  filter,  the  more  closely  the  mod- 
eled spectrum  can  approach  the  speech  spectrum.  If 
too  few  coefficients  are  used,  detail  in  the  speech  spec- 
trum is  effectively  discarded,  and  cannot  thereafter  be 
recovered.  Further  losses  occur  as  the  LPC  coeffi- 
cients are  quantised  for  transmission.  Some  at  the 
spectral  accuracy  lost  during  quantisation  may  be  re- 
covered during  resynthesis  by  appropriate  smoothing 
and  interpolation  algorithms.  The  foregoing  two  pro- 
cesses limit  the  spectral  accuracy  that  can  be  achieved 
for  a single  frame  of  q>eech.  We  have  called  this  "sta- 
tic spectral  accuracy.” 

Each  frame  of  quantized  coefficients  represents  the 
input-speech  spectrum  at  a particular  Instant  of  time. 
The  smaller  the  intervals  Vetween  successive  analysis 
frames,  the  larger  the  maximum  rate  of  spectral  change 
that  can  be  accurately  retained  in  the  reconstituted 
speech.  Thus  the  frame-analysis  Interval  controls 
“dynamic  spectral  accuracy.” 

Toe  speech  materials  we  developed  attempted  to  tar- 
get these  sources  of  spectral  errors  by  concentrating 
within  slni^e  sentences  all  phonemes  having  similar 
acoustic  properties,  as  shown  in  Table  I.  The  results 
of  the  experiment,  which  we  describe  in  more  detail 
below,  suggest  thid  our  attempt  was  successful.  Sub- 
jects judging  the  quality  of  these  sentences,  as  pro- 
cessed by  a variety  of  vocoders,  are  in  effect  able  to 
compare  the  vocoders  with  respect  to  a sini^e  source 
of  degradation  al  a time,  which  greatly  simplifies  their 
task. 

In  the  earlier  experiment,  the  sentences  shown  in  Ta- 
ble I were  recorded  by  20  speakers,  from  which  a sub- 
set of  three  males  and  three  females  were  chosen,  such 
that  the  full  range  of  speaker  characteristics  found  in  the 
group  of  twenty  was  retained.  The  resuttlng  36  test  sen- 
tences (6  sentences  x6  speakers)  were  processed  by  a 
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TABLK  I.  T«at  moImom. 

sEss=3aBss=s::^BBanBBaeaaBsa^saaBHa 
P>»n»m«-«p>cUlc  — oi>ne— 

(1)  mf  you  Mr«y  a yuar,  Roy  t 
(t)  Naaay  may  kaow  my  maaalac. 

(3)  Hla  Wotoua  fatkor  haa  aalauraa. 

(4)  Whtch  tsa-party  did  Raftor  (o  toT 

Ooaaral  — ntoac— 

(ft)  Tho  Uttta  blaakaU  lay  arouad  oa  tko  Boor, 

(4)  Tho  troubio  with  awlmmlaf  la  that  you  oaa  drowa. 


act  of  t««tv«  almulatad  vocoders,  which  used  log-erwe 
rattoa  as  the  traasoilsslon  (Muramators.  (For  the  many 
deslrabta  propartiea  of  log-area  ratios,  see  VlBwanathan 
and  Makhoul,  1973).  After  chooelng  the  number  of  poles 
(13.  11,  or  9),  and  frame  rate,  the  quantlaatlon  step 
slse  of  each  system  was  chosen  so  as  to  equate  the  bit 
rates  of  all  twelve  systems  at  3600  bits  per  see.  Quan* 
Uaatloa  step  slse  varied  between  0.3  and  1.75  dB. 

Seven  of  the  systems  used  fixed  transmission  rates  of 
67,  90,  or  40  frames  per  sec,  and  tho  remaining  five 
were  Vflt  systems  with  average  frame  rates  between 
47  and  91  per  sec.  Pitch  and  gain  were  coded  In  11 
bits,  and  were  transmitted  at  the  frame  rate  used  for 
the  coefficients,  la  the  flxnd  rate  systems,  but  at  a 
fixed  rate  of  50/per  sec  for  the  VFR  systems,  to  avoid 
confounding  excitation  and  spectral  variables.  In  the 
VFIt  systems,  the  Inimt  speech  was  analysed  every  10 
msec,  but  the  resulting  data  frame  was  not  transmitted 
unless  the  spectral  difference  between  It  and  the  pre- 
ceding transmitted  frame  exceeded  a threshold.  The 
spectral  difference  was  measured  using  a log-llkellhood 
ratio  measure  (Itakura,  1975,  Makhoul  sf  si. , 1974), 
and  thresholds  between  1.0  and  1.75  dB  were  used. 
Therefore,  frames  were  sent  every  10  msec  during 
rapidly  changing  parts  of  the  speech,  but  as  seldom  as 
every  60  msec  during  slowly  changing  portions.  For 
each  of  the  five  VFR  systems,  the  parameter  values 
were  chosen  so  that  tho  average  transmission  rate  over 
alt  96  test  sentences  was  about  3600  bits  per  sec.  The 
waveform  was  low-pass  filtered  at  5 kHx,  sampled  at  10 
kHB,  and  preemphaslaed  by  dUferenclng,  before  pro- 
cessing through  the  vocoders. 

Subjects  rated  the  degradation  of  speech  quality  In 
each  of  the  96  stimulus  sentences  as  processed  I9  each 
of  tho  13  vocoders.  Mean  ratings  were  analysed  by  a 
multidimensional  scaling  program  (MDPRBF,  see  Car- 
roll,  1973),  which  represents  the  vocoder  systems  as 
points  in  an  N-dlmenslonal  space  (three  dimensional. 

In  our  case),  sad  each  speaker-sentence  combination 
as  a vector  through  the  space.  The  performance  of 
each  vocoder  on  a particular  speaker- sentence  combi- 
nation Is  repreaentsd  by  ths  projection  of  the  point  rep- 
resenting the  system  onto  the  vector  repreeentlng  the 
stimulus  sentence. 

The  results  showed  a clear  separation  of  the  systems 
(1)  as  a function  of  the  number  of  poles,  and  (3)  as  a 
function  of  the  frame-analysis  Interval,  of  the  vocoders. 
Furthermore,  the  separation  along  these  two  dimen- 
sions was  orthogonal,  suggsatlng  that  the  perceptual  ef- 


fect of  changing  ths  number  of  poles  (‘‘static'*  Mteetral 
aocuracy)  was  independent  of  the  perceptual  effect  of 
ckaaglng  the  frame-analyals  Interval  (‘Myaamlc**  spec- 
tral accuracy).  The  orlentatloa  of  the  test-sentence 
vectors  la  tho  space  showed  that  ths  separation  of  the 
fixed-rate  systems  by  fraaM-snalyaU  Interval  was 
achieved  as  a result  of  the  MMclaUy  composed  sentence 
materials  fTable  I),  with  the  short  analysls-latervat 
systems  performing  better  on  the  rapidly  changing 
sentence  [see  sentence  (4)],  and  Um  long  analysis-in- 
terval systesM,  with  more  bits  per  frame,  doing  better 
oa  the  slowly  varying  sentences  [(1)  and  (3)].  Tho  VFR 
systems  wvre  located  correctly  for  their  frame-analy- 
sis Intervals  of  10  msec.  Ftirther  ovidonee  that  our 
sentences  dUfored  la  rate  of  spectral  change,  as  re- 
quired, Is  provldsd  by  measurements  of  the  average 
frame  rates  across  the  five  VFR  systems.  The  aver- 
age VFR  rate  was  lowest  tor  each  of  the  alx  speakers 
in  sentences  (1)  and  (3),  and  highest  In  sentence  (4). 
Separation  of  the  vocoders  as  a function  of  the  number 
of  poles  resulted  from  the  use  of  ths  different  talkers, 
with  the  relatlvo  peiformance  of  systems  with  13,  11, 
and  9 polos  oa  a particular  sentence  being  highly  cor- 
related with  the  mean  fundamental  frequency  la  the 
sentence.  Nine-pole  systems  performed  almost  as  well 
as  11-  or  13-polo  systems  oa  high-pitched  sentences, 
but  much  worse  on  low-pitched  sentences. 

The  five  VFR  systems  laclnded  In  this  study  per- 
formed less  well  than  expected.  Although  they  did  per- 
form better  than  fixed  rate  systems  on  the  rapidly 
changing  sentences  [sentences  (9)  and  (4)  la  Table  I], 
they  performed  worse  than  some  of  the  fixed-rate  sys- 
tems on  the  slowly  changing  sentences,  sentences  (1) 
and  (3),  and  about  equally  well  on  the  general  sentences, 
sentences  (5)  and  (6).  Oa  the  other  hand,  the  average 
franoe  rate  of  the  VFR  systems  was  hlf^r  than  that  of 
the  fixed-rate  systems  during  the  rapidly  changing 
sentences,  and  lower  during  the  slowly  changing  sen- 
tences, which  may  partly  account  for  the  observed  per- 
formance. At  the  same  time,  the  large  expected  ad- 
vantages of  the  VFR  systesBS  did  not  appear,  and  the 
experlnnent  described  below  vras  perfornoed  specifically 
to  establish  that  they  do,  in  fact,  occur. 

).  FROCEDURE 

Equating  the  bit  rates  of  all  vocoders  In  the  earlier 
study  meant  that  any  pair  of  vocoders  dlffored  In  at  least 
two  parameter  values,  comparisons  difficult. 

Therefore  for  the  present  study,  which  had  Uw  explicit 
aim  of  comparing  systems,  we  adopted  a factorial  de- 
slgi,  la  which  two  values  of  each  of  the  three  parame- 
ters occurred  In  every  possible  combination.  Details 
of  the  systems  are  shown  In  Table  n.  Theee  systems 
represent  a much  wider  range  of  qualltlea  than  was 
used  la  the  earlier  study. 

Bach  system  used  either  11  or  6 poles.  The  log- 
area  ratio  coefficients  were  quantised  In  steps  of  either 
0.5  or  3.0  dB  (Vlswaaathan  and  Makhoul,  1979).  LPC 
analysis  of  the  speech  signal  was  carried  out  at  50 
frames  per  sec,  and  the  threshold  of  the  VFR  schente 
was  set  to  either  0 dB,  la  which  case  every  analysed 
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TABLK II.  ByUiB  panunf  r«  tor  Mw  •tght  vooodTi 


System 

I.  D. 

PQR 

Poles 

Quant 

(dB) 

VFR 

Thresh 

Rato 

F/sec 

mu 

p«r  MO 

A 

ooo 

11 

0.5 

OdB 

50 

3157 

B 

001 

11 

0.5 

S.5dB 

23 

1831 

C 

010 

11 

2.0 

OdB 

50 

2155 

D 

oil 

11 

2.0 

2.5  dB 

23 

1348 

K 

100 

8 

0.5 

OdB 

50 

2521 

F 

101 

8 

0.5 

2.5  dB 

23 

1458 

G 

110 

8 

2.0 

OdB 

50 

1771 

H 

111 

8 

2.0 

2.5  dB 

23 

1118 

frame  was  transmitted,  yielding  a fixed  frame  rate  of 
SO  per  sec,  or  2.5  dB,  which  resulted  in  a variable 
frame  rate  that  averaged  23.3  per  sec.  Note  that  2.5 
dB  represents  a very  coarse  threshold,  and  that  the  re> 
suiting  average  frame  rate  is  less  than  60%  of  the  aver- 
age frame  rate  of  the  VFR  systems  In  the  earlier  study, 
over  the  same  sentences.  Pitch  and  gain  were  coded  in 
11  bits  and  transmitted  at  a constant  rata  of  50  frames 
per  sec,  as  in  the  VFR  systems  in  the  earlier  study. 

A subset  of  the  36  stimulus  sentences  used  in  the  first 
study  was  selected.  To  ensure  that  the  subset  was  rep- 
resentative of  the  whole  set  of  36,  we  chose  the  two 
"general”  sentences  from  Table  I [l.e. , sentences  (5) 
and  (6)],  since  between  them  these  contain  (almost)  all 
the  phonemes  of  English.  We  eliminated  the  phoneme- 
specific  sentences,  since  they  form  a balanced  set,  and 
choosing  one  of  them  would  have  entailed  choosing  the 
others  as  well.  We  then  selected  two  speakers,  one 
male  and  one  female,  such  that  the  vectors  corre- 
qoondlng  to  their  productions  of  the  two  general  sen- 
tences were  separated  as  widely  as  possible  in  the 
MDPREF  solution  space  of  the  earlier  study.  To  con- 
firm that  these  four  stimulus  sentences  were  adequate- 
ly representative,  we  repeated  the  MDPREF  anal- 
ysis at  the  earlier  study,  using  only  the  subset  of  data 
collected  on  the  four  sentences.  The  solution  obtained 
was  highly  similar  to  the  solution  obtained  with  the 
whole  set  of  36  sentences,  and  achieved  the  same 
orthogonal  separation  of  the  systems  by  number  of 
poles,  and  by  frame  rate.  This  test  confirmed  that  the 
selected  subset  was  indeed  representative. 

The  four  sentences  were  passed  through  the  eight 
simulated  vocoders,  and  were  recorded  in  two  random 
orders  on  the  stimulus  tape,  with  order  of  sequential 
presentation  counts ibalanced  fully  across  system  pairs, 
and  as  far  as  possible  across  sentence  pairs,  with  the 
constraint  that  no  system  and  no  sentence  should  follow 
itself.  Eight  subjects  were  then  run  individually  through 
two  exact  repetitions  at  the  tape— although  the  subjects 
were  not  aware  of  the  repetition.  Thus  each  subject 
made  four  ratings  on  each  of  the  32  stimulus  sentences. 
They  rated  the  degradation  of  what  they  heard  on  a sev- 
en-point scale,  1-7,  with  "overflow  bins”  (0  and  8)  at 
each  end.  That  is,  if  a stimulus  sounded  appreciably 
better  than  a previous  one  labeled  with  a "1”,  the  sub- 
ject was  allowed  to  use  a "0"  response. 

II.  RESULTS  AND  DISCUSSION 

The  mean  ratings  assigned  to  the  eight  systems  are 
shown  in  Fig.  1,  where  the  ratings  are  plotted  against 


overall  bit  rate  Including  pitch  and  gain.  Lines  join 
each  pair  of  systems  that  differ  in  only  a single  param- 
eter; solid  lines  join  all  pairs  of  systems  that  differ 
only  in  frame  rate;  daah^  lines  join  pairs  of  systems 
that  differ  only  in  the  number  of  poles;  and  dotted  lines 
join  pairs  that  differ  only  in  quantisation  step  else. 

Consider  first  the  three  lines  leaving  system  A,  at  the 
upper  right  hand  corner  of  Fig.  1.  For  each  parame- 
ter, system  A has  the  parameter  value  aseociated  with 
better  speech  quality.  Bit  rate  can  be  reduced  for  this 
system  in  three  ways;  (1)  by  reducing  the  number  at 
poles,  (2)  by  coarsening  the  quantisation,  or  (3)  by  going 
to  a VFR  system.  The  figure  shows  that  reducing  the 
number  of  poles  resulted  in  the  smallest  savings  in  bits, 
accompanied  by  a large  loss  of  quality.  Increasing  the 
quantisation  step  slse  yielded  a slightly  better  rate  of 
bits  saved  per  unit  quality  loss.  More  bits  were  saved, 
but  at  a cost  of  a slightly  larger  reduction  in  quality. 
Both  the  largest  savings  in  bits  and  the  smallest  drop 
in  quality  were  associated  with  the  Introduction  of  the 
VFR  scheme.  Similar  conclusions  can  be  drawn  from 
looking  at  the  gains  in  quality  achieved  by  Increasing 
the  bit  rate  of  the  worst  system,  system  H.  The  small- 
est quality  Improvement,  with  the  largest  cost  in  extra 
bits,  was  obtained  by  abandoning  the  VFR  scheme.  For 
one  pair  at  otherwise  Identical  systems,  going  from 
fixed  to  variable  frame  rate  reduced  the  bit  rate  by 
about  40%  with  no  effect  on  quality  (see  aystems  C and 
D in  Fig.  1).  All  but  three  of  the  quality  differences, 
between  pairs  of  systems  joined  by  lines,  are  extreme- 
ly significant— that  is,  well  beyond  the  0.001  level. 

The  three  exceptions  were  (1)  the  quality  difference  be- 
tween systems  C and  D,  which  was  not  significant;  (2) 
the  difference  between  systems  G and  H,  which  just 
failed  to  reacn  significance  at  the  0. 05  level,  and  (3)  the 
difference  between  F and  H,  which  was  just  significant 
(f><0.05). 

Comparison  of  the  variances  of  the  judgments  for 


I > 1 « 


AVEMOC  SIT  RATE  (Kk**) 

FIO.  1.  Mean  degradation  rating  la  plotted  against  average 
bit  rata  (inoludlng  pitch  and  gain),  fbr  each  of  the  el|d>t  LPC 
vocoder  sysUma  tested.  See  text  tor  more  details. 
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FIU.  3.  The  effect  of  apaakor,  aanlenco,  and  vocoder  param- 
eter on  apeech  quality.  Mean  degradation  ratlnga  are  plotted 
(a)  agalnat  mean  bit  ratea  for  each  of  the  two  apeakera;  (b) 
fur  each  of  the  two  aantancea;  and  (c)  averaged  acroaa  apeak- 
era and  aontencea.  The  aolld  llnea  connect  the  pointa  repre- 
aentlng  the  averagea  for  the  four  ayatema  with  fixed  frame  rate 
(50  framea  per  aec)  with  the  pointa  rapreaentlng  the  average 
for  the  four  ayatema  with  variable  frame  rate  <83  framea/aecl. 
Similarly,  the  daahed  llnea  join  the  meana  for  the  four  ayatema 
ualng  0. 6 dB  quantitation  with  thoae  ualng  2. 0 dB.  The  dottad 
llnea  Join  the  meana  for  all  the  11-pole  ayatema  with  thoae  lor 
alt  the  8 -pole  ayatema. 


pairs  of  systems  showed  that  two  pairs  of  systems 
yielded  significant  variance  ratios.  The  system  pairs 
are  A and  E,  and  B and  F,  both  of  which  differ  only  In 
the  number  of  poles  used.  The  quality  Judgments  for 
the  speech  passed  through  the  8-pole  systems  (E  and  F) 
had  a much  broader  variability— In  fact  the  distribution 
was  blmodal,  and  Fig.  2 shows  why.  Hare,  In  Figs. 

2(a)  and  2(b),  the  Judgments  are  broken  down  by  speaker 
and  by  sentence.  Judgments  for  all  8-pole  systems  are 
pooled,  and  appear  at  the  left  end  of  the  dotted  lines. 
Those  for  11-pole  systems  appear  at  the  right  end  of 
the  dotted  lines.  Dashed  lines  show  the  mean  effect  of 
quantisation  and  the  solid  lines  show  the  mean  effect  of 
variable  frame  rate 

Figure  2(a)  shows  that  there  Is  a strong  Interaction 
between  the  speaker  and  the  effect  of  number  of  polee. 
The  male  speaker's  speech  was  severely  degraded  by 
the  8-pole  systems,  whereas  the  female  speaker's 
speech  was  little  affected.  In  fact,  for  the  female 
speaker,  reducing  the  number  of  poles  yielded  a rate  of 
quality  decline  per  bit  saved  no  greater  than  that  ob- 
tained by  adopting  VFR  tranamlsalon.  This  finding 
corroborates  a result  of  our  earlier  study  (Huggins  and 
Nickerson,  1078),  In  which  we  found  a strong  Interaction 
between  the  vocoder  and  the  talker's  voice  In  determin- 
ing speech  quality.  The  relative  speech  quality  of  sys- 


tems using  13,  11,  a:rKi  9 poles  on  a particular  sentence 
was  highly  dependent  on  the  mean  fundamental  frequency 
In  the  test  sentence.  However,  It  !•  likely  that  the 
critical  variable  Is  not  the  fundamental  frequency,  but 
rather  the  length  of  the  epeaker's  vocal  tract,  which 
terds  to  correlate  highly  with  fundamental  (large  men 
have  low  volcee).  A speaker  with  a long  vocal  tract  has 
lower-frequency  formants  than  one  with  a short  vocal 
tract,  so  there  may  be  more  formants  to  be  modeled 
within  the  5-kHz  passband  of  the  vocoder.  To  separate 
the  effects  of  fundamental  frequency  from  those  of  vocal- 
tract  length,  one  would  have  to  repeat  the  experiment, 
using  materials  that  held  one  constant  while  varying  the 
other.  For  example,  tract  length  could  be  held  con- 
stant while  fundamental  was  varied,  by  having  single 
speakers  produce  each  sentence  several  times,  at 
widely  differing  pitches.  Tract  length  could  be  varied, 
with  fundamental  held  constant  by  having  several 
speakers,  with  widely  different  tract  lengths,  all  pro- 
duce a sentence  at  the  same  fundamental. 

III.  CONCLUSIONS 

Our  results  confirm  that  VFR  transmission  can  yield 
substantial  savings  In  bit  rate,  with  only  minor  loss  of 
quality.  The  rate  of  bits  saved,  per  unit  quality  loas. 

Is  highest  for  savings  achieved  by  VFR  transmission, 
and  lowest  for  those  achieved  by  reducing  the  number  of 
poles  used  In  spectral  modeling— at  least  for  the  param- 
eter values  studied  here.  Secondly,  there  are  major 
Interactions  between  perceived  speech  quality  and  the 
fundamental  frequency  of  the  talker,  for  some  systems. 

During  the  course  of  this  research,  It  has  become 
clear  that  the  method  used  to  decide  whether  or  not  the 
current  data  frame  should  be  transmitted  Is  of  para- 
mount Importance  In  maintaining  good  speech  quality. 

The  log-llkellhood  ratio  method  we  used  was  very  sim- 
ple to  Implement,  and  It  performed  well  In  rapidly 
changing  sentences,  but  did  not  seem  to  work  well  In 
slowly  changing  sentences.  We  have  recently  developed 
a new  VFR  scheme  (Vlswanathan  «f  al.  1077),  In  which 
log-area  ratios  are  used  directly  In  deciding  which 
frames  to  transmit,  and  which  explicitly  takes  Into  ac- 
count the  linear  Interpolation  performed  at  the  receiver 
to  approximate  the  coefficients  In  the  frames  whose 
transmission  Is  suppressed.  Thus  It  Is  sensitive  to 
spectral  errors  that  arise  anywhere  between  two  trans- 
mitted frames,  rather  than  considering  only  the  end 
points.  This  work  has  demonstrated  good  quality  trans- 
mission with  average  frame  rates  as  low  as  26  per  sec 
(and  as  low  as  18  per  sec  on  the  slowly  changing  sen- 
tences). Informal  listening  tests  showed  that  the  speech 
transmitted  at  26  frames  per  sec  by  the  new  method  was 
of  better  quality  than  that  transmitted  at  37  frames  per 
sec  by  the  likelihood  ratio  method. 


•’a  condenaed  version  of  this  paper  was  presented  at  the  e2nd 
meeting  of  the  AoousUoal  tkwletjr  of  America,  San  Diego, 
California,  Nov  18-19,  1976.  The  reeearch  was  supported 
by  the  Information  Processing  Techniques  branch  of  the  Ad- 
vanced Reaearch  Projects  Agency. 
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These  lists  were  developed  by  K.  N.  Stevens  (1962a,  b) , and  have 
never  been  published  before  in  their  entirety.  We  thank 
Professor  Stevens  for  permission  to  include  them  here. 

K.  N.  Stevens,  H.  H.  L.  Hecker  and  K.  D.  Kryter,  "An  Evaluation 
of  Speech  Compression  Systems,"  BBN  Report  No.  914,  March  1962a. 

K.  N.  Stevens,  "Simplified  Nonsense-Syllable  Tests  for  Analytic 
Evaluation  of  Speech  Transmission  Systems,"  J.  Acoust.  Soc. 
Amer.,  p.  729,  Nay  1962b. 
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While  there  exlat  methoda  In  the 
literature  for  objectively  evaluating  the 
Intelligibility  of  speech  In  the  presence  of 
stationary  noise,  little  has  been  done 
regarding  the  objective  evaluation  of  either 
the  Intelligibility  or  the  quality  of  vocoded 
speech.  We  present  a framework  within  which 
we  have  begun  a step-by-step  program  to 
develop  objective  measures  for  vocoded  speech 
quality  that  are  consistent  with  results  from 
subjective  tests. 

1.  Introduction 

The  ultimate  criterion  for  determining 
the  quality  of  the  speech  that  is  produced  by 
any  compression,  encoding  or  transmission 
system  is  the  way  It  sounds  to  the  human 
listener.  Although  there  are  well  established 
procedures  to  test  the  Intelligibility  of 
speech,  little  work  has  been  done  in 
developing  procedures  to  test  speech  quality, 
and  In  particular  vocoder  speech  quality.  The 
few  procedures  that  are  available  are 
subjective  and  require  extensive  testing  with 
human  listeners,  which  Is  expensive  In  terms 
of  both  time  and  money. 

It  would  be  desirable  to  develop 
objective  procedures  for  speech  quality 
evaluation  that  correlate  well  with  the  scores 
obtained  from  subjective  listening  tests. 
These  objective  measures  would  ensure 
uniformity  In  evaluation  as  well  as  enable  the 
evaluation  to  be  done  by  computer.  Also,  the 
measures  can  be  used  In  the  design  of  better 
quality  vocoders.  While  there  exist  methods 
In  the  literature  for  objectively  evaluating 
the  Intelligibility  of  speech  in  the  presence 
of  stationary  noise  [1,2],  little  has  been 
done  regarding  the  objective  evaluation  of 
either  the  intelligibility  or  the  quality  of 
vocoded  speech.  The  problem  Is  that  If  one 
regards  the  distortion  In  the  vocoded  speech 
signal  as  noise  superimposed  on  the  signal, 
then  this  noise  Is  not  only  nonstationary  but 
Is  correlated  with  the  signal.  This  makes  the 
problem  of  objective  evaluation  of  vocoded 
speech  quality  a difficult  one.  However, 
given  the  immense  long-term  benefits  In  terms 
of  time  and  expense,  any  headway  Into  the 
solution  of  the  problem  is  desirable. 

This  paper  presents  a framework  within 
which  we  have  begun  a step-by-step  program  to 
develop  objective  measures  of  vocoded  speech 
quality  that  are  consistent  with  results  from 
subjective  tests. 

2.  Necessary  Ceiiili.tlB,na 

Let  s(n)  be  the  original  speech  signal 
and  s'(n)  a vocoded  version  of  the  same 
signal.  Our  aim  is  to  develop  measures  that 
compare  the  quality  of  s*(n)  relative  to  s(n}. 
Note  that  the  formulation  of  this  problem  is 
different  from  that  of  the  objective 


evaluation  of  speech  Intelligibility  In  the 
presence  of  noise.  In  the  latter,  the  noise 
spectrum  Is  assumed  stationary  and  can  be 
measured  directly.  The  resulting  objective 
Intelligibility  scores  are  obtained  by 
comparing  the  average  signal  spectrum  to  the 
noise  spectrum  [1,2).  The  same  procedure 
cannot  be  applied  in  the  case  of  vocoded 
speech  because  the  "noise"  that  corrupts  the 
signal  is  not  well  defined,  and  in  any  case 
not  easily  measured.  Even  if  the  latter  were 
possible,  such  noise  cannot  be  considered 
stationary  and  it  is  also  correlated  with  the 
signal.  Therefore,  one  must  somehow  compare 
the  vocoded  signal  s*(n)  to  the  original 
signal  s(n). 

One  of  the  main  problems  In  comparing 
s'(n)  to  s(n)  Is  that  of  time  synchronization, 
so  that  corresponding  segments  of  the  two 
signals  can  be  compared.  However,  assuming 
that  somehow  one  is  able  to  align  the  two 
signals,  the  problem  of  comparing  s'(n)  to 
s(n)  remains. 

In  many  communication  systems,  the 
average  mean  squared  difference  error  between 
two  signals  is  taken  as  a measure  of  distance 
or  deviation  between  the  two  signals.  It  is 
simple  to  show  that  such  an  error  measure 
cannot  be  a measure  of  the  difference  In 
quality  between  the  two  signals.  This  is  done 
by  offering  a counterexample.  Let  s(n)  be  the 
input  to  an  all-pass  filter,  and  let  s'(n)  be 
its  output.  The  filter  can  be  designed  such 
that  the  wave  shape  of  ^'(n)  Is  quite 
different  from  s(n),  and  such  that  the  mean 
squared  difference  between  s*(n)  and  s(n)  is 
large.  However,  we  know  from  perceptual 
experiments  that.  In  all  likelihood,  the 
difference  between  a'(n)  and  s(n)  is 
insignificant  as  Judged  by  a human  listener 
(at  least  for  vocoder  purposes).  In  fact,  it 
Is  well  known  that,  except  for  pitch,  phase 
information  is  quite  Irrelevant  to  the 
perception  of  speech  [3].  It  is  difficult  to 
Imagine  an  error  criterion  on  the  waveform 
which  would  be  Insensitive  to  phase. 

The  answer  Is  clearly  to  go  to  the 
spectrum.  In  fact,  vocoders  have 

traditionally  transmitted  parameters  related 
to  the  magnitude  of  the  spectrum.  Channel 
vocoders  have  used  one  type  of  phase 
realization  for  synthesis,  and  LPC  vocoders 
have  used  another  (minimum  phase).  The 
problem,  then,  seems  to  reduce  to  a comparison 
between  the  short-time  spectra  of  s*(n)  and 
s(n).  But  the  spectrum  Is  only  one  aspect  of 
the  signal  that  Is  Important  to  perception  and 
Is  distorted  by  the  vocoder.  The  other 
Important  aspect  Is  the  source  Information. 

After  some  thought  It  became  clear  to  us 
that  objective  measures  for  the  evaluation  of 
vocoded  speech  quality  must  obey  two  maxims: 
(1)  They  must  be  a function  of  the  vocodlng 


process,  and  In  particular  the  vocoder 
transnlsslon  parameters, 

(2)  They  must  somehow  relate  to  perception. 

The  first  maxim  basically  says  that  the 
objective  evaluation  of  vocoded  speech  quality 
cannot  be  done  ab.stractly,  treating  s'(n)  as 
some  arbitrary  distortion  of  s(n),  but  rather 
It  must  relate  directly  to  the  vocodlng 
process.  The  second  maxim  merely  states  the 
obvious  necessity  to  have  the  objective 
measures  be  perceptually  meaningful.  These 
two  maxims  not  only  form  a sound  b.'.s^s  on 
which  to  build  these  measures,  but  also  offer 
the  hope  of  a diagnostic  tool  for  the 
evaluation  and  refinement  of  vocoder  design. 
Based  on  the  two  maxims,  therefore,  we 
proceeded  to  develop  the  general  framework  for 
objective  quality  evaluation. 

3.  Determiners  2X  QualUY 

In  a vocoder  system,  there  are  four  major 
Identifiable  components  that  can  contribute  to 
the  degradation  of  vocoded  speech  quality: 
analysis,  encoding,  transmission,  and 
synthesis.  We  shall  discuss  the  types  of 
errors  Introduced  by  the  different  components, 
Iti  an  effort  to  Identify  the  major  determiners 
of  speech  quality  In  the  vocodlng  process. 
This  would  then  give  us  a handle  with  which  to 
design  our  objective  measures. 

Transmission 

Channel  transmission  errors  are  an 
Important  factor  In  the  choice  of  a vocoder 
system.  In  that  different  vocoders  are 
affected  differently  by  different  types  of 
channel  errors.  However,  given  that  error 
correcting  codes  can  reduce  sharply  the 
effective  error  rate,  one  must  still  explain 
the  degradation  In  quality  due  to  the  vocoder 
Itself.  Therefore,  In  attempting  to  develop 
objective  quality  measures,  we  shall  assume 
that  channel  transmission  errors  are 
negligible . 

Analysis 

The  Importance  of  the  analysis  component 
Is  apparent  when  we  consider  that  it  embodies 
the  particular  speech  model  employed.  The 
parameters  extracted  In  this  component 
determine  the  upper  bound  on  the  quality  of 
the  synthesized  speech. 

The  general  vocoder  speech  model  Is  that 
of  a source  exciting  a system  that  represents 
the  short-time  spectrum.  We  shall  restrict 
our  discussion  here  to  LPC  vocoders,  with  the 
knowledge  that  It  can  be  extended  easily  to 
other  types  of  vocoders  (e.g.  channel 
vocoders).  The  LPC  model  Is  that  of  a source 
with  a relatively  flat  spectral  envelope, 
exciting  an  all-pole  filter.  There  are  three 
main  types  of  LPC  vocoders,  depending  on  the 
type  of  source  excitation;  residual  excited, 
voice  excited,  and  pitch  excited.  However, 
all  three  types  of  vocoders  perform 
essentially  the  same  type  of  analysis  to 
obtain  the  filter  parameters.  Although  there 
may  be  speech  quality  differences  depending, 
for  example,  on  whether  the  covariance, 
autocorrelation  or  lattice  method  of  linear 
prediction  Is  used,  these  differences  tend  not 
be  of  a major  nature.  The  upper  bound  on  the 
vocoded  speech  quality  Is  basically  a function 


of  the  type  of  excitation  used.  This  Is 
discussed  below  for  each  of  the  three  types  of 
LPC  vocoders. 

Residual  Excited  Vocoder.  In  this  type 
of  vocoder  the  residual  signal  Is  used  to 
excite  a filter  that  Is  the  exact  Inverse  of 
the  filter  used  to  generate  the  residual 
signal  from  the  speech  signal.  Assuming  no 
quantization  errors  In  either  the  residual 
signal  or  the  filter  parameters,  the 
synthesized  signal  s'(n)  will  be  almost 
Identical  to  the  original  signal  s(n). 
Therefore,  here,  the  atialysls  Itself  does  not 
degrade  the  speech  quality. 

Voice  Excited  Vocoder ■ In  this  type  of 
vocoder  [5,6],  a down  sampled  baseband 
comprises  the  source  Information.  At  the 
receiver  the  baseband  la  nonllnearly  processed 
to  obtain  an  excitation  function  with  a flat 
spectrum.  Even  under  no  parameter 
quantization,  the  synthesized  signal  s'(n) 
will  be  different  from  s(n).  Therefore,  the 
speech  model  employed  Is  already  responsible 
for  a certain  change  In  the  speech  quality 
when  compared  to  the  original.  One  method  of 
estimating  this  change  In  quality  would  be  to 
compare  the  filter  excitation  signal  for  this 
vocoder  to  the  residual  signal  used  In  the 
residual  excited  vocoder.  Such  comparison  Is 
probably  not  straightforward,  but  It  Is  made 
easier  by  the  fact  that  the  two  signals  are 
more  or  less  time-synchronized  (In  terms  of 
whore  pitch  periods  are,  etc.). 

Pitch  Excited  Vocoder ■ In  this  case,  the 
excitation  Is  either  a sequence  of  pitch 
pulses  or  white  noise.  Here,  s'(n)  resembles 
3(n)  In  Its  gross  features,  but  certainly  not 
In  the  detailed  signal  structure.  Also, 
unlike  the  voice  excited  case,  s'(n)  Is 
generally  not  synchronized  with  s(n),  because 
the  volced/unvolced  (V/UV)  excitation  Is  not 
synchronized  with  the  residual  signal,  which 
makes  It  difficult  to  get  an  objective 
estimate  of  the  change  In  quality  due  to  the 
pitch  excited  model.  This  Is  unfortunate 
considering  that  the  V/UV  decision  Is  perhaps 
the  single  most  Important  one  that  affects  the 
quality  of  s'(n).  There  are  currently  no 
established  procedures  for  the  automatic 
evaluation  of  V/UV  decisions.  The  existing 
procedures  are  manual,  in  that  intervention  by 
a human  is  necessary  to  establish  whether  a 
voiced  or  an  unvoiced  decision  would  be 
appropriate  for  each  frame  In  the  analysis 
(and  whether  the  extracted  pitch  value  Is 
accurate).  In  certain  critical  situations, 
such  decisions  are  made  by  trial  and  error  as 
to  which  sounds  better.  There  are  other  cases 
where  a mixed  volced-fricatlon  source  is  more 
appropriate.  Thus  far,  these  cases  have  not 
been  dealt  with  successfully  In  vocoders. 

Because  of  the  dearth  of  good  testing 
procedures  to  evaluate  the  effects  of  the 
excitation  on  speech  quality,  we  have  decided 
to  table  this  problem  In  our  initial  search 
for  objective  measures  of  quality. 

Synthe.sls 

Although  a large  part  of  the  synthesis 
proces.'^  Is  dictated  by  the  type  of  model  used 
and  slgnjl  analysis  performed,  there  remain  a 
number  of  design  choices  In  the  synthesizer 
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that  can  noticeably  affect  the  aynthealzed 
apeech  quality.  The  major  cholcea  relate  to 
the  updating  and  Interpolation  of  filter 
parametera,  aa  well  aa  the  choice  of  the 
filter  Implementation  atructure.  For  example, 
we  have  found  that  If  the  analyala  la 
performed  tlme-aynchronoualy , It  la  beat  to 
Interpolate  and  update  filter  parametera 
tlme-aynchronoualy  aa  well  (7]. 

Although  there  are  Important  laauea 
relating  to  filter  Implementation  atructure 
(for  example,  placing  the  gain  at  the  output 
of  a normalized  filter  [8]  cauaea  "cllcka"  to 
occur  during  large  changea  In  gain).  It  la 
alwaya  poaalble  to  chooae  the  Implementation 
atructure  In  auch  a way  that  the  atructure 
Itaelf  contrlbutea  negligibly  to  the 
degradation  of  the  quality. 

Egg-gding 

We  Include  In  thla  component 

(1)  the  choice  of  the  number  of  parametera  to 
tranamlt , 

(2)  how  to  quantize  them,  and 

(3)  when  to  tranamlt  them. 

The  parametera  Include  the  aource  parametera 
(the  realdual  algnal  In  a realdual  excited 
vocoder,  or  pitch  and  gain  In  a pitch  excited 
vocoder),  and  the  aynthealzer  parametera, 
which  can  take  different  forma,  with  the  moat 
popular  being  the  log  area  ratloa  In  an  LPC 
vocoder  [9],  or  the  output  energlea  of  the 
channel  flltera  In  a channel  vocoder.  The 
choice  of  the  number  of  parametera,  along  with 
their  quantization,  determine  to  a large 
extent  the  atatlc  algnal  quality  at  apeclflc 
time  Inatancea,  while  the  tranamlaalon  and 
update  rate  determine  the  dynamic  algnal 
fidelity. 

CgaglMian 

For  narrow-band  vocoder  ayatema  (leaa 
than  5000  bpa),  the  encoder,  aa  we  have 
defined  It,  la  the  major  determiner  of  apeech 
quality.  Thla  la  due  to  the  heavy 
quantization  that  la  neceaaary  to  produce  low 
bit  ratea.  Dealgn  laauea  In  the  analyal.''  ind 
ayntheala  are  Important,  but  for  low  rale 
ayatema,  the  encoder  playa  the  major  role. 

It.  General  FrangWgfK 

It  followa  from  the  prevloua  aectioi 
that.  If  the  bulk  of  the  aynthealzed  apeech 
quality  la  determined  by  the  encoder,  then  one 
ahould  be  able  to  obtain  an  approximate 
objective  meaaure  of  the  quality  difference 
between  the  original  and  vocoded  apeech  by 
aomehow  comparing  the  parameter  valuea  at  the 
Input  and  output  of  the  encoder.  One  could 
alao  Include  the  Interpolation  In  the 
ayntheala  component,  and  compare  the  parameter 
valuea  at  the  aynthealzer  with  the  parametera 
at  the  Input  to  the  encoder  (which  are 
produced  by  the  analyala).  In  any  caae,  the 
problem  la  thua  reduced  from  comparing  the 
quality  of  two  apeech  algnala  to  comparing  two 
aeta  of  parametera  that  are  related  to  each 
other  In  a well  apeclfled  manner.  Thla,  in 
turn,  Impllea  that  auch  comparlaona  or  quality 
meaaurlng  procedurea  are  to  be  built  "Inalde" 
the  vocoder  Inatead  of  outalde  It . 
Comparlaona  are  made  between  the  unquant  1 zed 
parametera  (reference  ayatem)  and  the 


parameter  valuea  uaed  at  the  aynthealzer  (teat 
ayatem) . 

Inherent  In  the  above  analyala  la  that 
apeech  aynthealzed  ualng  the  Input  parametera 
to  the  encoder  la  of  very  good  quality.  Thla 
la  not  difficult  to  achieve.  For  example.  In 
an  LPC  vocoder.  If  the  algnal  bandwidth  la 
5 kHz,  then  a 14-pole  analyala  every  10  ma 
would  give  unquantlzed  parametera,  which  when 
uaed  In  the  ayntheala,  would  reault  In  apeech 
whoae  quality  la  very  good  compared  to  the 
original  apeech.  Thla  doea  not  neceaaarlly 
mean  that  the  encoder  haa  to  quantize  the  14 
filter  parametera  and  tranamlt  them  every  10 
ma.  The  reatrlction  la  merely  on  the 
analyala.  The  encoder  may  then  chooae  a 
amaller  (and  perhapa  variable)  order  for 
tranamlaalon,  and  at  a lower  (and  perhapa 
variable)  rate  [7]. 

We  now  atate  the  three  obaervatlona 

(asaumptlona)  that  form  the  baala  for  our  work 
In  developing  objective  quality  evaluation 
meaaurea ; 

(1)  Cpeech  aynthealzed  from  unquantlzed 

parametera,  extracted  every  10  ma,  la  of 
very  good  quality  compared  to  thf  original 
apeech . 

(2)  Except  for  pitch  and  gain,  the  fidelity  of 
the  ahort-tlme  apectrum  la  the  principal 
determiner  of  quality. 

(3)  The  apectrum  la  uniquely  defined  by  the 
filter  parametera. 

The  flrat  obaervatlon  gives  us  an  anchor 
point  defined  In  terms  of  the  system 
parameters  and  against  which  to  compare 

quantized  realizations  of  the  same  utterance. 

The  second  and  third  observations  relate  the 
filter  parameters  to  apeech  quality  through 
the  concept  of  spectral  fidelity.  This,  then, 
gives  us  a framework  within  which  to  develop 
the  desired  objective  measures  of  speech 
quality.  ] 

5.  An  Initial  EAggrimsal  j 

Given  a speech  utterance  processed  by  an 
LPC  vocoder,  an  objective  measure  summarizes 
the  error  or  deviation  between  the  reference 
and  the  teat  sets  of  parameters  In  terras  of  a 
single  number  which  we  shall  call  an  objective  i 

evaluation  score.  The  objective  score  would 
be  expected  to  reflect  the  perceived  quality  ! 

(relative  to  the  reference)  of  the  speech  ! 

utterance  If,  Indeed,  the  objective  measures 
were  sensitive  to  all  quality-determining 
factors.  It  la  unreasonable,  and  perhaps  too 
simplistic,  to  expect  that  one  objective 
measure  could  always  correctly  predict 
perceived  apeech  quality.  The  chance  of  such 
a prediction  may  be  enhanced  by  combining  a 
number  of  objective  measures  In  some  fashion 
to  obtain  an  overall  objective  score.  Each 
measure  may  be  sensitive  to  some  aspects  of 
quality.  Ultimately,  we  plan  to  perform  a 
multldlmenaional  analysis  [11]  on  the  \ 

objective  scores  obtained  from  a number  of 
measures  with  the  hope  of  relating  them  to 
different  quality  dimensions  yet  to  be 
discovered.  For  the  present  study,  however, 
we  chose  to  develop  a number  of  objective 
measures  and  Investigate  each  of  them 
separately  so  as  to  become  familiar  with  their 
properties. 
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For  each  data  frane,  an  error  between  the 
reference  and  the  test  parameters  is  computed 
using  an  appropriate  "distance"  measure. 
Ideally,  such  frame  errors  should  be  computed 
only  at  selected  points  In  time  within  the 
speech  utterance  that  are  "perceptually 
significant."  For  the  purposes  of  the  present 
study,  we  simply  computed  the  frame  error  at  a 
fixed  rate,  say,  every  10  ms.  Ue  thus  had  two 
problems.  (1)  To  develop  suitable  distance 
measures  to  compute  frame  errors.  (2)  To 
combine  all  the  frame  errors  within  a speech 
utterance  into  one  number,  which  provides  the 
objective  score. 

We  considered  several  distance  measures 
for  computing  the  error  between  the  reference 
and  test  parameters  of  a given  frame.  These 
measures  were  based  on  the  power  spectrum  of 
the  all-pole  linear  prediction  filter. 
Traditional  mean  squared  differences  between 
log  spectra,  as  well  as  other  measures  were 
used.  The  errors  were  also  frequency  weighted 
In  different  ways,  including  a weighting  based 
on  the  articulation  Index  [10].  The  resulting 
error  sequence  at  each  frame  was  then  combined 
to  give  the  overall  objective  score.  The 
sequence  was  time  weighted  using  the  filter 
gain  and  the  "spectral  difference"  (rate  of 
change  of  spectrum)  between  frames. 

These  objective  measures  were  used  in  an 
Initial  experiment  to  correlate  the  objective 
scores  with  the  results  of  a rank  ordering 
experiment  of  subjective  quality  that  compared 
different  vocoder  systems  [11].  Different 
combinations  of  objective  measures  were  used 
In  the  experiment.  Comparisons  of  the 
objective  and  subjective  scores  Indicated  that 
no  single  objective  measure  was  able  to  always 
predict  correctly  the  subjective  rank  ordering 
of  vocoded  speech  utterances.  Furthermore, 
the  objective  scores  were  heavily  clustered 
(relative  to  the  subjective  scores)  for  the 
different  systems.  Indicating  a lack  of 
separability . 

6.  Program  Research 

Based  on  our  Initial  experiments  It 
became  clear  that  what  we  need  Is  a 
step-by-step  program  to  understand  the 
different  aspects  of  the  relations  between 
spectral  variations  and  speech  quality,  in 
order  to  be  able  to  begin  developing  the 
desired  objective  measures  of  quality.  First, 
we  shall  attempt  to  discover  the  quality 
determining  factors  in  the  spectrum 
Independent  of  time.  Following  that,  wo  shall 
attack  the  more  difficult  problem  of 
discovering  the  time-dependent  quality 
determining  factors. 

As  a first  step,  we  have  begun  to  develop 
spectral  distance  measures  that  are  consistent 
with  published  perceptual  data  on  vowel 
difference  llmens.  This  work  Is  described  In 
a separate  paper  [10].  One  of  the  Important 
conclusions  there  Is  that  traditional  distance 
measures  between  log  spectra  are  not 
consistent  with  perceptual  data. 

7.  Conclusions 

In  this  paper  we  presented  the  rationale 
behind  the  general  framework  for  the  objective 
evaluation  of  vocoder  speech  quality.  The 


framework  calls  for  Inserting  these  objective 
measures  inside  the  vocoder  to  compare  the 
sets  of  filter  parameters  after  analysis  and 
before  synthesis,  in  order  to  observe  the 
effects  of  encoding  and  Interpolation  on  the 
resulting  spectra.  Spectral  variations,  in 
turn,  are  related  to  speedh  quality.  A 
step-by-step  program  has  been  Initiated  to 
discover  the  time-independent  as  well  as 
time-dependent  quality  determining  factors  In 
the  short-time  spectrum. 
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This  paper  considers  distance  measures 
for  determining  the  deviation  between  two 
smoothed  short-time  speech  spectra.  Since 
such  distance  measures  are  employed  In  speech 
processing  applications  that  either  Involve  or 
relate  to  human  perceptual  Judgment,  the 
effectiveness  of  these  measures  will  be 
enhanced  If  they  provide  results  consistent 
with  human  speech  perception.  As  a first 
step,  we  suggest  Flanagan's  results  on 
difference  llmens  for  formant  frequencies  as 
one  basis  for  checking  the  perceptual 
consistency  of  a measure.  A general  necessary 
condition  for  perceptual  consistency  Is 
derived  for  a class  of  spectral  distance 
measures.  A class  of  perceptually  consistent 
measures  obtained  through  experimental 
Investigations  Is  then  described,  and  results 
obtained  using  one  such  measure  under 
Flanagan's  test  conditions  are  presented. 


One  of  his  results  Is  particularly  relevant  to 
this  paper.  Briefly,  when  two  formants  are  In 
close  proximity,  human  perception  exhibits  an 
asymmetrical  pattern  In  that  moving  one  of  the 
two  formants  closer  to  the  other  by  a given 
amount  produces  a larger  perceived  quality 
difference  than  moving  that  formant  away  from 
the  other  by  the  same  amount.  On  the  other 
hand,  the  same  formant  shifts  produce  a 
symmetrical  pattern  when  the  two  formants  are 
well  separated.  We  use  this  result  as  one 
basis  for  checking  the  perceptual  consistency 
of  spectral  distance  measures. 


Smoothed  spectra  can  be  obtained  by  using 
a number  of  methods  such  as  filter  bank, 
cepstrun,  and  linear  prediction  (LP).  For 
simplicity,  we  focus  In  this  paper  on  LP 
spectra,  although  roost  of  the  discussions 
presented  below  apply  to  other  types  of 
spectra  as  well.  The  LP  spectrum  Is  given  by 
[11] 


Given  two  smoothed  short-time  speech 
spectra,  a fundamental  problem  In  speech 
processing  Is  to  determine  the  distance  or  the 
amount  of  deviation  between  the  two  spectra. 
In  speech  recognition,  the  two  spectra  may 
correspond  to  two  different  speech  sounds,  or 
perhaps  two  different  versions  of  the  same 
sound  [1-3]-  In  speaker  verification  or 
Identification,  the  two  spectra  may  correspond 
to  speech  produced  by  either  two  different 
speakers  or  by  the  same  speaker  on  two 
different  occasions  [b,5].  In  variable  frame 
rate  speech  compression,  two  adjacent  analysis 
frames  may  have  produced  the  two  spectra 
[6,7].  In  the  problem  of  objective  evaluation 
of  vocoded  speech  quality,  which  the  authors 
have  recently  formulated  [8],  the  two  spectra 
may  correspond  to  the  quantized  and  the 
unquantlzed  sets  of  filter  parameters.  Still 
another  application  of  spectral  distance 
measures  Is  In  the  spectrrl  sensitivity 
analysis  needed  for  optimal  parameter 
quantization  [9]. 


where  G Is  the  linear  predictor  gain,  R,  Is 
the  speech  signal  energy,  V.  Is  the  normalized 
prediction  error,  S(u)  Is  tne  spectrum  of  the 
Inverse  filter  and  a^ , lik^p,  are  the 
predictor  coefficients. 


Let  d(X,y)  denote  the  distance  or 
deviation  between  the  spectra  X(id)  and  Y(iii). 
From  a mathematical  viewpoint,  one  may  be 
tempted  to  Insist  that  the  distance  measure 
satisfy  the  three  axioms  of  a metric: 


(a)  Positive  definiteness: 
d(X,Y)*0  Iff  X*Y; 

(b)  Symmetry:  d(X,Y)sd( Y,X) 

(c)  Triangle  Inequality: 
d(X,Y)id(X,2)*d(Z,Y). 


These  examples  clearly  bring  out  the 
importance  of  spectral  distance  measures  In 
speech  processing.  The  extent  to  which  a 
distance  measure  Is  valid  greatly  determines 
the  efficiency  of  the  underlying  task  In  which 
It  is  employed.  Inasmuch  as  one  strives  to 
achieve  a machine  performance  that  Is  close  to 
what  a human  can  do  under  the  same  situation 
(e.g.,  first  two  applications  above),  or 
Inasmuch  as  the  vocoded  speech  Is  to  be 
perceived  by  human  listeners.  It  Is 
appropriate  to  require  of  these  distance 
measures  to  be  at  least  consistent  with  the 
known  results  of  human  perception.  The  work 
reported  In  this  paper  represents  a first  step 
towards  obtaining  perceptually  consistent 
measures  of  spectral  distance. 


Ue  require,  however,  only  the  property  (a)  to 
be  true.  There  are  many  examples  In  real  life 
where  distance  symmetry  does  not  hold.  There 
Is  no*  evidence  to  support  the  validity  of  a 
symmetrical  distance  In  the  context  of  human 
speech  perception.  For  a similar  reason,  we 
do  not  Insist  that  the  property  (c)  be 
necessarily  true.  We  postulate  that  If  a 
distance  measure  Is  perceptually  consistent. 
It  will  prove  to  perform  better  In 
applications  Involving,  or  relating  to,  human 
perception . 


Before  we  define  a measure  of  distance 
between  two  LP  spectra  P,  (w)  and  Pt(w),  it  may 
be  desirable  to  normalize  these  spectra  In 
some  fashion.  For  Instance,  they  may  be 
normalised  to  have  the  same  arithmetic  mean 
(AM)  or  total  energy.  Alternately,  they  may 


About  two  decades  Flanagan  reported 
perceptual  results  for  determining  difference 
llmens  for  formant  frenuencles  of  vowels  [10]. 


r: 
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t>e  normalized  to  have  the  same  geometric  mean 
(CM),  i.e.,  the  log  spectra  will  have  the  same 
average . 

Error  Definition 

An  error  function  between  the  normalized 
spectra  can  be  defined  either  in  the  (linear) 
spectral  domain  as 

e((D)  - Pj^Coi)  - PjCdi)  , (2) 
or,  In  the  log  spectral  domain  as 


deviation.  (Frequencies  of  the  other  three 
fixed  formants  and  the  noninal  value  of  the 
second  formant  frequency  are  given  In  the 
figure.  Fixed  bandwlJths  of  all  the  formants 
are  as  In  [10].)  Fig.  1(a)  corresponds  to  the 
error  definition  (5)  while  Fig.  1(b) 
corresponds  to  the  error  definition  (3).  Both 
plots  were  obtained  using  CM  normalization, 
ks1  and  no  weighting  in  (6).  (We  have  plotted 
log  d for  plot  (b)  so  that  ordinates  of  both 
plots  are  In  decibels.)  The  almost  symmetrical 
plots  In  Fig.  1 do  not  conform  with  properties 
given  by  Flanagan  (see  Fig.  i((c)  In  [10]). 


e (g») 

« log  Pj(o))  - log  P2(u) 

(3) 

Other  reasonable  error  definitions  include 

e (bi) 

- fPj^(b))  - P2(u))j/P^(u)  , 

(4) 

e (u) 

- P^(u))/P2(u)) 

(5) 

A 

large 

class  of  spectral  distance 

measures 

can 

be  defined  as  the  weighted  L^ 

norm : 

/ W(P^(u))  ,P2(a))  ,ui)  |e(ii))  rdu 

1 

Jc 
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,W)  « 
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f H(P^  (hi)  ,P.  (u)  ,u)du 
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where  the  weighting  function  W In  general 
depends  on  Pi((j),  Pj(u)  and  frequency  u,  and 
takes  only  positive  values.  If  the  error  is 
defined  as  in  (k)  or  (5),  the  distance  measure 
in  (6)  is  not  symme^tric.  Also,  if  the 
weighting  function  depends  explicitly  on  P, 
and  Py  , the  resulting  distance  measure  is  in 
general  not  symmetric.  In  all  other  cases,  a 
symmetric  distance  measure  results.  In  the 
absence  of  any  weighting,  d-i  is  the  harmonic 
mean,  d,  is  the  CM,  dy  is  the  AM,  and  d,  is 
the  root  mean  square  value  of  the  absolute 
error  function.  Between  the  minimum 
d.a  =Min  |e(u)  ) and  the  maximum  d,,  =Max  |e((»)  | , 
d|,  is  a monotonically  increasing  function  of 
k. 


The  weighting  function  W in  (6)  is  used 
to  differentially  weight  individual  errors  and 
is  determined  based  on  some  concept  of  speech 
perception.  Notice  that  any  constant 
multiplicative  factor  in  the  weighting 
function  does  not  affect  the  distance  measure. 
Some  specific  weighting  functions  are 
discussed  in  Section  4. 

Examples ; References  [2,6,7]  use  d^  with  the 
error  defined  as  in  (5).  (With  a Gaussian 
assumption,  this  measure  becomes  a likelihood 
ratio  [2].)  Reference  [9]  employs  dy  with  the 
error  given  by  (3).  Cepstral  distance 
measures  used  in  [1,9]  have  been  shown  to  be 
highly  correlated  to  d.  with  the  error  as  in 
(3)  [12]. 

3-  A MfiafiaaarY  Lac  Esca^cLwaL 

Consistency 

Fig.  1 shows  two  plots  of  spectral 
deviation  or  distance  versus  frequency  shift 
of  the  second  formant  causing  that  spectral 


Notice  that  the  two  distance  measures 
that  produced  the  plots  in  Fig.  1 depend  only 
on  the  ratio  of  the  spectra  Py  and  P*  (in  view 
of  (3)  and  (5)).  Below  we  prove  that  with  CM 
normalization,  any  distance  measure  which  is  a 
function  of  only  the  ratio  of  the  spectra  is 
necessarily  perceptually  inconsistent.  First, 
we  give  our  working  definition  of  perceptual 
consistency,  based  on  Flanagan's  results  [10]. 

Working  Definition  of  Perceptual  Consistency 
Let  X and  Y be  two  vowel  spectra,  such  that  ¥ 
is  identical  to  X except  that  one  of  the 
formant  frequencies  F is  shifted  by  a variable 
amount  AF.  A given  spectral  distance  measure 
d(X,Y)  between  X and  Y is  said  to  be 
perceptually  consistent  if 

(a)  when  F is  close  to  another  formant  F', 
d(X,Y)  exhibits  asymmetry  such  that  it 
is  greater  when  F is  moved  AF  towards 
F',  than  when  F is  moved  AF  away  from 
F'; 

(b)  such  asymmetry  decreases  as  F and  F' 
are  further  apart. 

Now,  consider  a class  D,  of  spectral 
distance  measures  defined  by  (6)  where  the 
error  e(ii))  is  computed  after  GM  normalization 
of  the  spectra.  For  this  class  of  distance 
measures,  a necessary  condition  for  perceptual 
consistency  is  provided  below  in  the  form  of  a 
theorem. 

Theorem;  A necessary  condition  for  any 
spectral  distance  measure  d(Pi  ,P2  ) in  the 
class  Df  to  be  perceptually  consistent  (as 
defined  above)  is  that  it  not  be  a function  of 
only  the  ratio  of  the  two  spectra  P^  and  P,  . 

Proof;  Assume  that  a distance  measure  in  Dj 
violates  the  necessary  condition.  We  show 
that  this  distance  measure  is  not  perceptually 
consistent.  Let  Pj  be  obtained  from  P|  by 
shifting  only  one  of  its  formant  frequencies 
while  keeping  all  other  parameters  intact. 
Let  the  denominator  S(ui)  in  (1)  be  factored 
into  R(u)  and  S‘(u),  where  R(u)  is  the 
contribution  to  the  spectrum  from  the  formant 
under  consideration  and  S‘(u)  represents  the 
contributions  from  all  other  poles  of  the 
linear  predictor.  Thus,  P» (u)  * 

1/(fl,(u)).S',  (ui))  and  P,(ta))  . 1/(  R,  (ui)  .s;(u)) ) , 

where  Rt(iii)  is  the  perturbed  version  of  Rj(ui). 
This  gives  the  result  that  the  ratio  of  P|  and 
Pa  depends  only  on  the  formant  under 
consideration.  Specifically,  the  ratio  does 
not  depend  on  whether  or  not  this  formant  is 
in  close  proximity  to  another  formant.  This 
clearly  establishes  that  the  measure  is  not 
perceptually  consistent  according  to  our, 
working  definition. 


with  other  types  of  spectral 
normalization , the  ratio  of  gain  terms  (C*  In 
(1))  of  the  two  spectra  depends  In  general  on 
the  overall  shape  of  the  spectrum.  For 
Instance,  with  AH  normalization,  this  ratio  Is 
between  the  normalized  prediction  errors  (Vp 
In  (1)}  corresponding  to  the  two  spectra, 
which  depend  on  the  total  spectral  shapes  [3]- 
Establishing  a general  necessary  condition  for 
perceptual  consistency  In  these  cases  Is 
difficult.  However,  with  AH  normalization, 
our  experimental  results  show  that  when  the 
necessary  condition  stated  above  la  violated, 
perceptual  consistency  Is  not  obtained. 

We  do  not  wish  to  state  that  perceptually 
Inconsistent  measures  are  not  useful.  In 
fact.  In  the  applications  mentioned  In  the 
Introduction,  many  such  measures  have  been 
succesfully  used.  We  suggest,  however,  that 
use  of  perceptually  consistent  measures  In 
these  applications  may  lead  to  an  Improved 
performance  of  the  underlying  tasks. 

n.  Baigfating  fmictiaaa 

We  have  Investigated  a number  of 
reasonable  frequency  weighting  functions  [13]. 
A brief  discussion  of  some  of  these  weighting 
functions  Is  given  below. 

Spectral  latgflglLy  MfilgULLos 

Since  formant  peaks  of  a spectrum  are 
perceptually  Important,  It  Is  reasonable  to 
emphasize  spectral  errors  that  occur  close  to 
formant  peaks.  One  way  of  achieving  this 
error  weighting  Is  to  use  P«(w}>or  some 
generalized  mean  of  the  two  as  weighting 
functions . 

Erequency  ggciyaLixfi  Wgightlng 

An  alternate  method  of  emphasizing 
spectral  errors  that  occur  close  to  formant 
peaks  Is  to  employ  a suitable  function  of 
first  and  second  derivatives  of  P|(w)  or  P|(u) 
for  weighting  the  errors. 

Art Iculat Ion-Index  ( AI)  fiAUd  Welahtlna 

AI  Is  a physical  measure  that  Is  highly 
correlated  with  subjective  speech 
Intelligibility  results.  Since  It  Is  not 
unrealistic  to  consider  speech  Intelligibility 
and  quality  as  related  phenomena,  we  have 
derived,  by  adapting  some  of  the  results  used 
In  AI  computation,  a weighting  function  which 
decreases  exponentially  with  frequency: 
Wsexp(-au),  where  o Is  a particular  constant 
[131. 

All  the  spectral  distance  measures  that 
we  Investigated,  even  with  the  use  of  the 
above  weighting  functions,  had  one  common 
problem  In  that  for  the  case  when  the  first 
formant  frequency  was  shifted  about  the 
nominal  value  of  300  Hz,  a given  amount  of 
left  shift  always  produced  a larger  spectral 
deviation  than  a right  shift  of  the  same 
amount,  which  Is  Just  the  opposite  of  what 
Flanagan  reported  (see  Fig.  3(a>  In  [10]). 
(We  found,  however,  that  some  of  these 
measures  and  weighting  functions  produced  the 
right  types  of  asymmetry  In  other  test 
conditions  considered  by  Flanagan.)  To  attempt 
to  overcome  this  problem,  we  Investigated  the 


following  weighting  functions  baaed  on 
perceived  loudness. 

Perceived  Loudness  Welahtlna 

Based  on  the  work  of  Stevens  [14],  we 
define  the  perceived  loudness  function  L(id)  of 
a spectrum  P(u)  as  [P(«)A(u)]'^  , where  A(u) 
Is  shown  plotted  In  Fig.  2.  Notice  the  sharp 
change  of  A(u)  at  low  frequencies,  which  may 
be  used  to  our  advantage  to  overcome  the 
problem  mentioned  above.  The  weighting 
function  may  then  be  defined  In  terms  of  A(u) 
or  L(w).  We  have  Investigated  the  following 
weighting  functions:  WbA(w);  WsLi (u) 
(perceived  loudness  of  P|,(<ii));  Wstglu); 
Ws|L|(i>))-L|((d)|  . Only  the  weighting  function 
W<A(u)  produced  the  right  asymmetry  for  the 
case  when  the  first  formant  frequency  was 
shifted  about  Its  nominal  value  of  300  Hz. 

In  the  next  section,  we  give  examples  of 
perceptually  consistent  distance  measures 
which  use  the  weighting  function  A(u ) . 

5.  A £1a»  ParcMtuAlIy  Canalattnt 
Uauaafi  HMAuraa 

Our  experimental  investigations  have  led 
to  a class  of  spectral  distance  measures  which 
produce  the  right  types  of  asymmetry 
attributable  to  formant  Interaction  under  all 
test  conditions  considered  by  Flanagan.  This 
class  Is  defined  by  (6)  with  GH  normalization, 
the  spectral  error  defined  In  the  (linear) 
spectral  domain  as  In  (2),  and  the  weighting 
function  A((i>)  shown  In  Fig.  2. 

Figs.  3-5  show  plots  of  spectral  distance 
versus  formant  frequency  shift  under  three 
different  test  conditions  for  the  above 
measure  with  ksl  In  (6).  These  plots  compare 
rather  nicely  to  the  corresponding  ones  that 
Flanagan  has  given.  Notice  that  while  our 
spectral  distance  plots  In  general  have  a 
monotonlcally  Increasing  tendency,  Flanagan's 
plots  reach  a constant  100S  for  large  formant 
frequency  shifts  due  to  the  fact  that  subjects 
In  his  tests  were  asked  to  merely  say  If  they 
perceived  the  two  speech  sounds  corresponding 
to  unperturbed  and  perturbed  sets  of  formants 
as  being  different  rather  than  to  quantify  the 
amount  of  quality  difference  they  perceived 
between  the  two  sounds. 

The  effectiveness  of  the  weighting  A(u) 
Is  particularly  apparent  In  the  low  frequency 
region.  Fig.  6 shows  plots  of  spectral 
distance  with  and  without  this  weighting, 
other  Conditions  being  the  same,  for  the  case 
when  the  first  formant  Is  shifted  about  300 
Hz.  The  upweighted  measure  gives  a slight 
asymmetry  but  In  the  wrong  sense.  Fig.  6(a), 
while  the  weighted  meaaure  produces  the  right 
asymmetry  ae  shown  by  Fig.  6(b). 

A disadvantage  of  the  distance  measures 
presented  In  this  section  Is  that  they  are 
dependent  on  the  energy  of  the  spectra. 
(Notice  that  distance  measures  which  are 
functions  of  only  the  ratio  of  spectra  do  not 
suffer  from  this  disadvantage.)  With  energy 
'dependent  measures,  comparison  of  spectral 
distances  obtained,  for  Instance,  for 
different  analysis  situations  can  be 
meaningfully  done  only  after  suitably  scaling 
the  distance  values.  A reasonable  condition 
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to  Impose  on  such  scaling  is  that  the  spectral 
distances  corresponding  to  the  formant 
frequency  difference  llmens  at  the  different 
frequencies  be  approximately  equal.  This  will 
be  our  next  step  in  refining  the  class  of 
perceptually  consistent  spectral  distance 
measures  that  we  suggested  above. 


We  have  reported  preliminary  results  of 
an  ongoing  work  on  perceptually  consistent 
spectral  distance  measures.  Our  experience 
has  been  that  GM  normalization  works  better 
than  AM  normalization  Inasmuch  as  one  is 
looking  for  sensitivity  to  interaction  of 
formants.  The  results  we  have  presented  in 
this  paper  show  that  the  distance  is  best 
defined  in  terms  of  the  difference  in  the 
(linear)  spectral  values.  Besides  continuing 
our  investigation  reported  here,  we  plan  to 
use  the  developed  measures  in  several 
applications. 
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ABSTRACT 

Several  methods  are  presented  for  the 
objective  speech  quality  evaluation  of  narrowband 
LPC  vocoders,  based  on  a framework  that  we  proposed 
at  the  1976  ICASSP  conference.  In  each  method,  the 
error  in  short-term  spectral  behavior  between 
voooded  speech  and  the  original  is  computed  once 
every  10  as.  These  errors  are  appropriately 
weighted  and  averaged  over  an  utterance  to  produce 
a single  objective  score.  Several  short-term  error 
measures,  and  time-weighting  and  averaging 
tsohniques  are  investigated.  We  evaluate  the 
objective  methods  by  correlating  the  resulting 
objective  scores  with  formal  subjective  speech 
quality  Judgments.  High  correlations  obtained 
indicate  the  usefulness  of  these  methods. 


1.  INTRODUCTION 

Quality  assessment  of  vocoded  speech  is  often 
performed  to  determine  the  user  acceptance  of  . a 
vocoder,  or  to  compare  the  performance  of  competing 
vocoder  types,  or  to  evaluate  the  different  choices 
of  a given  vocoder's  design  parameters.  Procedures 
used  for  speech  quality  measurement  are  either 
subjective  or  objective,  depending  upon  whether  or 
not  they  stake  use  of  subjective  Judgcents  from 
human  listeners.  Subjective  procedures  require 
extensive  testing  with  human  listeners,  which  is 
expensive  in  terms  of  both  time  and  money.  On  the 
other  hand,  objective  measures  would  enable 
evaluation  to  bo  done  by  computer  as  well  as  ensure 
uniformity  in  speech  quality  evaluation.  Also, 
objective  measures  can  be  Incorporated  into  the 
design  of  better  quality  vocoders.  Of  course,  the 
validity  of  any  objective  procedure  must  first  be 
established  by  comparing  its  results  against 
subjective  Judgments. 

While  there  exist  a few  subjective  procedures, 
relatively  little  work  has  been  done  to  develop 
objective  procedures.  The  purpose  of  this  paper  is 
to  report  on  several  objective  measures  for  speech 
uality  assessment  of  narrowband  linear  predictive 
LPC)  vocoders.  We  have  developed  these  measures 
based  on  a general  framework  for  objective  speech 
quality  evaluation  that  we  presented  at  the  1976 
ICASSP  conference.  A specif iq  objective  procedure 
presented  also  at  that  conference  13)  falls  within 
this  general  framework. 

2.  A FRAMEWORK  FOR  OBJECTIVE  SPEECH  QUALITY 
EVALUATION 

Any  objective  measure  for  the  evaluation  of 
voooded  speech  quality  must  be  a function  of  the 
vocoder  transmission  parsmeters  and  must  somehow 
relate  to  perception.  Uaing  this  and  a number  of 
other  detailed  arguments,  we  formulated  in  [1]  a 


general  framework  for  objective  speech  quality 
evaluation,  based  on  the  following  reasonable 
assumptions: 

1)  Speech  synthesized  from  unquantized  vocoder 
parameters  extracted  every  10  ms,  is  of  very  good 
quality  compared  to  the  original  speech. 

2}  Except  for  pitch  and  gain,  the  fidelity  of 
the  short-time  speech  spectrum  is  the  principal 
determiner  of  quality. 

''  3)  The  spectrum  is  uniquely  defined  by  the 
linear  prediction  filter  parameters. 

The  first  assumption  gives  us  an  anchor  point, 
defined  in  terms  of  the  unquantized  vocoder 
parameters,  against  which  to  compare  quantized 
realizations  of  the  same  utterance.  The  second  and 
third  assumptions  relate  the  filter  parameters  to 
speech  quality.  In  the  above  formulation,  we  have 
implicitly  made  an  Important  assumption  that  the 
vocoder  under  evaluation  extracts  pitch  values  and 
voiced/unvoiced  decisions  without  any  error. 
Although  the  formulation  may  be  extended  to  cover 
voice-excited  and  residual-excited  LPC  vocoders  as 
well  as  vocoders  other  than  LPC  (e.g.,  channel 
vocoders)  [1],  here  we  consider  its  application 
exclusively  to  pitch-excited  (narrowband)  LPC 
vocoders. 

In  the  above  framework  the  problem  of 
objective  quality  evaluation  is  reduced  to  the 
following  two  steps:  1)  For  each  10  as  frame, 
compute  an  objective  error  as  the  distance  or 
deviation  between  the  spectrum  corresponding  to  the 
unquant Ized  LPC  parameters  and  the  spectrum 
corresponding  to  the  quantized  and  Interpolated  LPC 
parameters  (interpolation  is  required  if  the 
vocoder's  transmission  frame  rate  is  lower  than  100 
frames/sec.);  and  2)  combine  all  the  frame  errors 
thus  computed  within  a speech  utterance  into  one 
number,  which  becomes  the  objective  speech  quality 
score.  Methods  for  carrying  out  the  two  tasks  are 
presented  in  Sections  3-5. 

3.  FRAME  SPECTRAL  ERROR  MEASURES 

The  power  spectrum  of  linear  prediction 
all-pole  filter  is  given  by 

P(u)-G^/S(u.) -RqV  /|1+  I ^ (1) 

k"l 

where  G is  the  filter  gain,  R-  is  the  speech  signal 
energy,  V_  is  the  normalised  preedlction  error, 
S(:>:)  it  Che  spectrum  of  the  inverse  filter,  a|.  are 
the  predictor  coefficients,  and  p is  the  order  of 
the  linear  predictor.  To  compute  objective  frame 
error,  we  require  a measure  of  distance  between  the 
reference  spectrum  P.(u)  (unquantized  parameters) 
and  the  vocoded  speech  spectrum  P2(w)  (quantized 
and  interpolated  parameters).  AlthoOgh  there  are  a 
number  of  distance  measures  available  [2],  we 
consider  in  this  paper  three  distance  measures 
denoted  below  as  d1,  d2  and  d3. 
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For  d1  and  d2,  tha  diataaoa  batwaan  tha 

apaotra  P.  and  P9  la  eoaputad  In  Utraa  atapa  aa 
follows: 

a)  Horaallsa  tha  two  spaotra  by  aaklng  tha*  hava 
tha  saaa  (unity)  gaoaatrlo  naan  (l.a..  tha  araaa 
undar  tha  log  spaotra  ara  aada  aqual); 

b)  Dataralna  tha  arror  at  aaoh  fraquanoy  althar 

as  tha  nagnltuda  of  tha  dlffaranoa  In  llnaar 

spaotral  aaplltudaa  of  tha  two  norwtllsad  spaotra 
(d1)  or  as  tha  aagnltuda  of  tha  dlffaranoa  In  thalr 
log  spaotral  aaplltudas  (dZ);  and 

o)  Coaputa  a aultabla  nor*  of  this  arror 

funotlon. 

Notion  that  tha  gaoaatrlo  aaan  of  tha  powar 

spaotrj*  In  (t)  Is  VpNg.  Tha  two  ■aasuras  d1  and 
dZ  ara  given  balow  [Z,4]: 

I A(u)|s,  (w)-S/(«)ldw 
..0  * ^ (2) 
dl-  

/ A((i>)  sT^(u)dia 
0 * 

dZ-  [ i / (log  Sj(<a)  - log  SjCw)  |*d«]l 

whore  A((ii)  Is  a saoothod  version  [A]  of  tha 
walghtlng  funotlon  originally  Introduoad  by 
S.S.  Stevans  for  Baasurlng  tha  paroalvad  loudnasa 
[Zl.  Tha  dlstanoa  ■aaaura  dl  Is  paroaptually 
oonslstant  in  tha  sense  that  It  produoas  results 
oonslstant  with  published  subjective  peroeptual 
results  on  foroant  frequency  dlffaranoa  llaens 
[Z,il],  while  dZ  Is  not  perceptually  oonslstant. 

The  third  distance  Masure  d3  ■easuras  the 
absolute  deviation  In  the  log  area  ratios  (LANs) 
g^,  which  are  uniquely  related  to  the  predictor 
ooafflolents  a^,  and  tdtloh  possess  flat  or  unlfor* 
spaotral  sensitivity  as  well  as  other  desirable 
properties  [5l:  P 

J•lk■*2kl  • (*•) 

k“l 

where  the  two  sets  of  LANs  correspond  to  the  two 
linear  predictors.  Since  we  deal  with  LPC  vocoders 
that  s*ploy  LANs  as  transalsslon  parameters,  they 
are  readily  available  for  evaluating  d3  "Inside” 
the  vocoder.  Of  the  sbove  three  aeasures,  clearly 
d3  la  least  expensive  to  compute. 

A.  TIME-UEIGHTING  OF  FRAME  ERRORS 

The  task  of  combining  tha  fra*e  arrors  T‘. 
Into  a single  speech  quality  score  Involves  Ci 
weighting  the  frame  arrors  with  a sultar— » 

time-weighting  funotlon  N(l)  to  reflect  t^r 
relative  Importance  of  the  Individual  frames  U 
perceived  speech  quality,  and  than  averaging  tha 
weighted  frame  arrors  E(1)H(1).  Balow,  wa  describe 
two  tlme-walghtlng  methods  that  wa  have 
Investigated. 

Enerav  Helahtlna 

In  this  method , wa  main  the  raaaonabla 
asaumptlon  that  frame  arrors  In  low  anargy  raglona 
of  an  utterance  have  a amallar  Influanoa  on  quality 
Judgments  than  those  In  high  anargy  ragioma.  For 
example,  large  changes  In  tha  spaetru*  may  not  be 
dataotad  by  tha  listener  If  tha  total  smargy  la  tha 
spaetru*  Is  low.  No  oonslderod  tha  weighting  aa  a 
function  of  tha  frame  apaacb  algaal  energy  par 
sample,  eoaputad  over  an  interval  af  M aa  and 
axprassad  In  decibels.  We  have  Invaatlgatad  llmaar 
(HI)  and  plaoawlse-llnear  (HZ)  weighting  fumetlana. 


These  ara  depleted  In  Fig.  1.  The  piecewise- linear 
function  shown  Is  lass  drastic  than  tha  llnaar 
function  In  that  It  daomphaslsas  fraaa  arrors  In 
tha  low  anargy  region,  but  has  only  a alight 
Influanoa  on  all  other  fraaa  errors. 

Dvnamln  Fidelity  Halahtlna 

Another  consideration  relevant  to  speech 
quality  Is  the  rata  at  which  speech  oharaotarlatles 
change  In  time.  This  rate  varies  In  tine  In 
accordance  with  tha  saquenoo  of  speech  sounds  being 
uttered.  Since  It  is  reasonable  to  assume  that  a 
rata  of  LPC  parameter  extraction  of  100  fraaes/sac 
(fps)  should  be  adequate  to  track  all  pereoptually 
Important  speech  events,  we  chose  the  anohor  system 
as  having  100  fps  LPC  data  (Section  Z).  However, 
In  the  case  of  slowly  varying  speech  materials 
(e.g.,  JB1,  sea  Section  6),  the  actual  rata  of 
change  of  speech  characteristics  Is  substantially 
lower  than  that  In  normal  speech.  This  means  that 
parameter  extraction  at  rates  much  less  than  100 
fpa  can  adequately  represent  tha  slowly  varying 
speech.  This  poses  two  problems  with  r aspect  to 
tha  choice  of  our  anohor  aystaa.  Flrat,  the 
objective  apeech  quality  measure  computed  based  on 
tha  above  anohor  would  generally  yield  lower  arror 
whan  tha  transmission  fraaw  rata  of  tha  vocoder 
undar  avaluatlon  Is  closer  to  100  fps.  This  is 
because  when  the  vocoder's  fraaa  rate  Is  closer  to 
tha  anohor  system's  frame  rata,  frame  arror 
computation  Involves  fewer  parameter  errors  due  to 
Interpolation,  which  are  being  subatltutad  by 
quantisation  arrors,  and  these  are  generally 
smaller  than  Interpolation  errors.  Therefore,  for 
slowly  varying  speech  the  objective  measure  would 
overestimate  the  vocoded  speech  quality.  Secondly, 
subjective  apeech  quality  testa  have  clearly  shown 
that  a 100  fps  LPC  vocoder  produoas  Inferior  speech 
quality  (characterised  as  having  a "wobble* 
quality)  when  processing  slowly  varying  speech, 
compared  to  a 50  fps  vocoder  C6].  To  overccae 
these  problems,  we  redefine  the  anohor  system  based 
on  a functional  perceptual  nodal  (PN)  of  speech 
that  two  of  the  authors  have  recently  formulated 
[6],  In  this  model.  It  Is  postulated  that  (1) 
Speech  can  be  represented  In  terms  of  LPC  (or 
other)  parameters  extracted  at  a minimal  set  of 
perceptually  significant  frames,  not  necessarily 
equally  spaced,  and  (Z)  Between  any  two  such 
frames,  the  parameters  vary  linearly.  An  automatic 
scheme  has  been  develop^  to  determine  or  "mark" 
the  location  of  the  perceptually  significant  frames 
(61.  The  PN-bssed  anohor  system  Is  therefore 
eharaoterlsed  by  the  LPC  parameters  (actually, 
LARb)  of  the  frames  marked  by  this  scheme  and  the 
llmearly  Interpolated  parameter  values  for  the 
unmarked  frames.  Ha  have  presented  this 
modification  in  this  section,  since  It  nay  be 
viewed  aa  an  implicit  tlao-welghtlng  method.  In 
addition,  wo  have  Investigated  an  explicit 
tlae-wel^tlng  in  idtleh  frame  errors  for  the  narked 
frames  are  weighted  with  unity,  while  other  frame 
errors  are  weighted  with  a fraction  depending  on 
the  duretlen  of  the  transmission  Interval  to  which 
they  belong.  la  the  Interest  of  keeping  the 
presentation  of  reoults  in  Seotlon  6 brief,  we 
assume  unity  weighting  for  all  marked  and  unmarked 
fraaea. 

5.  TIME-AVtllAOl  OP  HBIOMTIB  FBANB  ERRORS 

Tha  final  stem  in  formulating  an  objective 
speech  quality  measure  la  to  apeelfy  bow  the 
weighted  frame  errors  M(1)E(1)  are  combined  Into 
one  number.  One  ehvleus  method  la  to  use  the 
weighted  r-th  mean  of  all  the  (nay,  L)  frame  errors 
ever  the  whole  utteramee: 
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I.  , L 1 1/r 

r M(1)  (E(i)i'/  r W(1)  (5) 
1-1  1-1 

Th«  slaplest  avarac*  of  this  typ«  is  tha  srlthMtlo 
■aan  with  r«1;  this  avarasa  is  danotad  by  El.  Two 
othar  avaragas  S3  and  B*  that  wa  hava  usad  ara 
daaoribad  balow. 

Subjactiva  quality  Judgaants  of  an  uttaranca 
ara  graatly  Influancad  by  tha  prasanoa  of  oven  one 
or  two  plaoas  of  largo  errors  such  as  those  that 
ara  parealvad  as  pops  or  glltohas.  An  overall 
average  such  as  El  aay  not  portray  such  Influences, 
aspaclally  if  aost  of  the  reaalnlng  fraaa  errors 
ara  saall.  Tha  above  r>th  naan  with  a large  r 
would  anphaslsa  large  fraaa  errors.  An  altarnata 
oathod  of  achieving  this  selectivity  to  large 
errors  is  to  oonsldor  tha  arlthaetlc  aaan  over  only 
a specified  nuabar  of  the  largest  fraaa  errors.  Wa 
define  a naaaura  B2  which  la  tha  average  over  tha 
top  10%  of  the  fraaa  errors.  A two-tera  coaposite 
average  aaasura  E3  is  obtained  as  tha  sun  of  El  and 
E2.  A different  coaposite  average  is  aotivated  by 
the  consideration  that  tha  nuabar  of  large  fraaa 
errors  which  influence  parealvad  quality  of  an 
utterance  nay  vary  fron  one  voooded  version  to 
another.  This  suggests  that  E2  aay  be  replaced  by 
a selective  average  that  is  carried  out  over  a 
variable  percentage  of  top  fraae  errors,  or 
alternately  E2  nay  be  aultlplied  with  a variable 
weight  and  then  added  to  El.  We  denote  this 
weighted  coaposite  average  by  EA: 

EA  > B1  «YB2  (6) 

In  our  experiaental  investigations,  we  obtained 
significant  Inproveaents  in  correlation  scores  when 
we  chose  the  following  exponential  weight  Y: 

Y-kle*‘^‘®3  (7) 
where  k1  and  k2  are  constants,  and  is  the 
"skewness*  of  the  fraae  error  distribution  over  the 
whcfle  utterance,  defined  by 

a,-  i I [E(1)-E1]^/o2  , (8) 

with  Op  being  the  standard  deviation.  Use  of  k2«-1 
was  round  to  laprove  the  perforaance  of  the 
objective  neasures  as  deterained  by  correlation 
against  subjective  Judgaents. 

6.  CORREUTION  AGAINST  SUBJECTIVE  JUDGMENTS 

With  three  fraae  error  neasures  d1-d3  (Section 
3) I two  energy  weighting  functions  W1  and  W2 
(Section  A),  the  pereeptual-nodel- based  dynaalc 
fidelity  weighting  schene  (Section  A),  and  three 
tioe-average  neasures  El,  E3,  and  EA  (Section  5), 
and  considering  different  values  for  the  constants 
that  figure  in  sone  of  the  above  iteas,  we  get  a 
large  variety  of  possible  objective  speech  quality 
neasures.  To  evaluate  the  effectiveness  of  a given 
objective  quality  neasure,  we  correlate  the 
objective  scores  that  the  neasure  produces  for  an 
utterance  processed  through  a range  of  LPC  vocoder 
systens  against  the  corresponding  subjective 
quality  Judgaents.  Notice  that  the  objective 
scores  for  the  various  voooded  versions  of  an 
utterance  are  all  based  on  the  saae  anchor,  and 
hence  the  scores  can  be  directly  coopered  to  infer 
quality  differences  between  different  vocoders  in 
processing  that  utterance. 

We  ooapute  two  types  of  correlation  between 
the  objective  and  subjective  data:  (1)  regular 


correlation  (or  sinply  correlation);  and  (2)  rank 
order  correlation.  For  the  second  type,  two  sets 
of  ranks  are  first  assigned  to  vocoders  under  study 
using  separately  objective  and  subjective  data,  and 
then  regular  correlation  is  conputed  between  the 
two  sets  of  ranks.  Therefore,  rank  order 
correlation  is  useful  in  exaaining  how  well  an 
objective  aeasure  would  order  vocoders  in  terns  of 
perceived  (subjective)  speech  quality.  On  the 
other  hand,  unlike  the  rank  order  correlation, 
regular  correlation  is  sensitive  to  the  actual 
extents  of  vocoder  quality  differences. 

Below,  we  first  briefly  discuss  the  subjective 
speech  quality  rating  that  was  collected  in  a 
recent  study  [7],  and  then  present  highlights  of 
the  results  obtained  by  correlating  objective 
quality  data  against  this  subjective  data. 

Subieetlve  flggg. 

Stinuli  for  the  subjective  rating  study  [7] 
were  nade  by  passing  7 sentences  through  each  of  A9 
fixed-rate  LPC  vocoders.  The  transaisslon  bit 
rates  for  those  vocoders  ranged  froa  1267  bps  to 
8700  bps.  The  different  vocoder  systens  were 
obtained  by  varying,  in  a factorial  nanner,  the  LPC 
order  (13,  11,  9 and  8 poles),  the  LAR  quantisation 
step  size  (0.5,  1 and  2 dB)  and  the  transaisslon 
fraae  rate  (100,  67,  50  and  33  fps).  The  A9th 
vocoder  had  13  poles,  0.25  dB  step  size,  and  100 
fps  fraae  rate.  A 110  kbps  PCM  speech  (11-bit 
saaples  at  10  kHz),  which  was  the  input  to  the 
vocoders,  was  also  Included  to  act  as  an  undegraded 
anchor.  Nine  subjects  rated  speech  quality 
degradation  on  a scale  of  0-100.  The  rating  scores 
averaged  over  the  nine  subjects  gave  the  subjective 
data  for  our  correlation  study.  To  keep  the 
overall  task  aanageable  (in  view  of  the  large 
variety  of  objective  neasures  we  were  considering), 
we  chose  a subset  of  22  vocoders  (all  the  12 
13-pole  systens,  and  5 each  of  11-pole  and  8-pole 
systens)  and  5 sentences  given  in  Table  1. 

Correlation  Results 

We  ran  extensive  correlation  experlaents  for 
several  purposes:  1)  to  use  the  correlation  scores 
as  a swans  of  choosing  the  paraneters  of  the 
tine-weighting  and  tlae-averaging  schenes  discussed 
above;  2}  to  study  the  effect  of  incorporating  into 
an  objective  evaluation  neasure  any  one  or  group  of 
the  different  tine-weighting  and  time-averaging 
scheaes;  and,  3)  to  pick  that  subset  of  these 
schenes  which,  for  a given  fraae  error  neasure, 
aaxiaizes  the  correlation  scores  on  the  average. 
In  short,  correlation  against  subjective  data 
served  as  the  prinary  vehicle  for  Judging  the 
effectiveness  of  an  objective  quality  aeasure. 
Results  obtained  froa  these  correlation  studies  are 
nuaerous  and  complex  due  to  the  interactions 
between  the  different  tine-weighting  and  averaging 
elaaents  as  well  as  the  profound  effect  of  speech 
naterlal  and  speakers.  Below,  we  provide  important 
highlights  of  these  results.  Since  the  vocoder 
input  speech  used  in  this  study  had  a 5 kHz 
bandwidth,  we  employed  a lA-th  order  anchor  system 
in  oonputing  the  objective  quality  scores. 

a)  Correlation  scores  obtained  for  male  speech 
( JB1 , JB5  and  DE6 ) were  generally  higher  than  those 
for  feoale  speech  (ARA  and  RS6).  (See  further 
below. ) 

b)  The  energy  weighting  function  W1  or  W2  and  the 
PM-based  iapliolt  tlae-weightlng  method  produced  in 
general  higher  correlations,  although  the  two 
nethods  did  not  always  reinforce  each  other. 

c)  By  and  large  the  averaging  method  EA  is 
superior  to  El  and  E3. 
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d)  For  tho  fraaa  error  Meeures  d1-d3,  we  found 

the  appropriate  tijee>wel8htln8  and  averaslns 

oethods  so  as  to  secure  on  the  average 
correlation  scores  for  the  5 utterances  we 

considered.  The  resulting  objective  quality 
aeasurea  H1-M3  are  described  In  Table  2,  and  their 
correlation  scores  are  given  In  Table  3*  For  both 
H2  and  M3,  PM-based  weighting  and  no  (or  unity) 
energy  weighting  ware  chosen,  while  for  HI  linear 
energy  weighting  W1  and  no  PH-based  weighting  ware 
preferred-.  This  nay  partly  be  due  to  the  fact  that 
the  autoaatlo  PH  schena  already  uses  (linear) 
energy  weighting  [6]. 

e)  The  correlation  soores  given  In  Table  3 range 
froa  0.685  to  0.9*17,  and  are  all  highly 
significant.  Note  that  for  22  "■easurenents'* 
corresponding  to  22  vocoders  a slgnlflcanoe  level 
of  better  than  0.00 1 la  achieved  with  a correlation 
score  of  only  0.652. 

f)  The  neasure  N2  based  on  the  ma  log  speotral 
error  and  the  aeaaure  M3  based  on  the  LAR  error 
were  found  to  behave  quite  slnllarly.  Slnoa  all 
three  quality  aeasurea  produced  good  correlation 
resulta,  we  recoaaend  the  use  of  N3  as  It  is  the 
least  expensive  of  the  three  ooaputatlonally. 
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As  aentloned  above,  and  as  shown  In  Table  3, 
all  the  reported  objective  aeasurea  yielded 
generally  lower  correlation  scores  for  feaale 
speech  than  for  sale  speech.  Also,  In  choosing  the 
coaponents  of  the  objective  aeasurea  M1-M3  given  In 
Table  2 we  did  not  sake  use  of  the  correlation 
scores  ooaputed  for  AR4  and  RS6,  slnoa  the  scores 
varied  relatively  spuriously  and  did  not  Indicate 
any  clear  choice.  One  reason  for  these  probleas  Is 
that  the  22  vocoders  considered  in  our  correlation 
study  drew  in  general  aore  clustered  subjective 
ratings  for  AR4  and  RS6  than  for  JB1,  JB5  and  DK6. 
A second  reason  (soaewhat  related  to  the  first)  is 
that  subjective  rating  scores  for  the  utterances 
froa  feaale  speakers  were  relatively  constant  over 
the  range  of  the  LPC  order  considered  (8-13  poles); 
In  contrast,  the  rating  scores  for  aale  speakers 
exhibited  a wide  range  of  variation.  Also,  the 
subjective  rating  data  for  feaale  speech  had 
several  exaaples  where  a vocoder  with  a lower  order 
drew  a better  rating  than  another  with  a higher 
order,  with  the  quantisation  step  slse  and  ffaae 
rate  being  the  saae  for  the  two  vocoders.  These 
considerations  suggest  that  the  order  of  the  anchor 
systea  aay  be  varied  as  a function  of  the  average 
fundaaental  of  the  speaker  over  the  whole 
utterance.  By  choosing  the  anchor  systea* s order 
as  12  poles  for  ARA  and  10  poles  for  RS6,  we 
obtained  definite  Increases  In  the  correlation 
scores,  although  the  scores  still  reaalned 
substantially  lower  (especially  for  RS6)  than  those 
obtained  for  Bale  speech. 


Sentence 


JBl  119  Why  were  you  away  a year,  Roy? 

ARA  165  Which  tea-party  did  Baker  go  to? 

JBS  124  The  little  blankets  lay  around  on  the  floor 

DK6  97  The  trouble  with  swlaalng 

RS6  193  la  that  you  can  drown.  


Table  1.  The  five  stloulus  sentences,  with  the 
speaker's  average  fundaoental  frequency  In  Ha. 


Energy  PM-based 
Weighting  Weighting 


Quality 

Measure 


Average 


Table  2.  Description  of  3 objective  speech 
quality  aeasurea. 
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Table  3.  Correlation  scores  obtained  for  3 
objective  aeasurea.  Values  In  parentheses 
are  rank  order  correlation  scores. 


1 Objective  Speech  Quality  Measure  | 

Ml 

M2 

N3 

.916(.699) 
.929(.928) 
.947 (.848) 

.927 (.808) 
.868(.823) 
.887(.82S) 

.90S(.872) 
.920(.899) 
.909 (.684) 

.853(.807) 

.716(.718) 

.885(.866) 

.812(.817) 

.928(.879) 

.68S(.653) 

Ml 

dl 

Linear, Ml 

No 

E4 

N2 

d2 

None 

Yes 

E3 

H3 

d3 

Hone 

Yes 

E3 

